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Pose- Invariant Face Recognition Using Real and 
Virtual Views 

by 

David James Beymer 
Abstract 

The problem of automatic face recognition is to visually identify a person in an input 
image. This task is performed by matching the input face against the faces of known people 
in a database of faces. Most existing work in face recognition has limited the scope of the 
problem, however, by dealing primarily with frontal views, neutral expressions, and fixed 
lighting conditions. To help generalize existing face recognition systems, we look at the 
problem of recognizing faces under a range of viewpoints. In particular, we consider two 
cases of this problem: (i) many example views are available of each person, and (ii) only one 
view is available per person, perhaps a driver's license or passport photograph. Ideally, we 
would like to address these two cases using a simple view-based approach, where a person 
is represented in the database by using a number of views on the viewing sphere. While the 
view-based approach is consistent with case (i), for case (ii) we need to augment the single 
real view of each person with synthetic views from other viewpoints, views we call "virtual 
views". Virtual views are generated using prior knowledge of face rotation, knowledge that 
is "learned" from images of prototype faces. This prior knowledge is used to effectively 
rotate in depth the single real view available of each person. In this thesis, I present 
the view-based face recognizer, techniques for synthesizing virtual views, and experimental 
results using real and virtual views in the recognizer. 

Thesis Supervisor: Dr. Tomaso Poggio 

Title: Uncas and Helen Whitaker Professor, Dept. of Brain and Cognitive Sciences 
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Introduction 



r • • „„ , rP able to effortlessly and instantaneously recognize 
Using cm - ^ - - „or, This ability enab,es as to perform 

rrJ^r- — e - - recogT; f « :r cars = 

machines exhibit intelligence. 

objects in a digitize .mage of a seen, Ob -t - ^e _ ^ 

a preliminary stage that may work, for example by g 

difficulty of the overall problem ^ ^ 

Why is object recogmt.on a d.fficc t problem J h gg P 
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Another problem is scene clutter. If the image scene contains m My objects, it may be 
difficult to .solate regions of the image that contain a single object LhemZ If 

:t2ab7e ;:r to ^r? e each ° thCT - th ™ ^ --^^ 

eyes are imperfect: real lenses tend to blur the image and the li„M m 
process invariably adds noise to the image. 8 measurement 

While different approaches exist to the problem of object recognition a simple 

appearance due to changes m pose and lighting are represented simply by stole 
many d.fierent example 2D views of the object. When attempting to rlgnTa 

zz ii'iiffi : rr r simply tries to match the ^ - —p> 

v.ew that , sufficently close m p „ se and lighting. Since pose-lighting parameter 
ace ,s mult,d_al, populating this space densely enough with exLpleTil 
aPproZlT hUgen ™ ber f —P'- ^us, one important issneintbeyiLbZ 
i he goal of this thes.s ,s to explore the view-based approach in a real application 

exT ' reC0S "' t,0n ' U " der difeent eX ' remeS in tCT ™ <> f «« of al I e 

example v.ews per person. We will examine two cases: 

1. many example yiews are available per person, and 

2. only one example view is available. 

The application problem itself, chosen to be a good candidate for the view-based 
approach, „D be pose-invariant face recognition. By "pose-invariant", we mean ^a 
the pose of the mput view is free to vary within a certain range. S nceThe inpu 
pose ,s allowed to vary, applying the view-based approach naturally reuuire Z 
exampk v.ews c each person, views in this case talcen from dhW^^Z 
ve o h mP ™ ** abOVe ' Wha ' ab °"' ^ —J -Lario whe e 

Z nl eX ™ P t"" ? ASSUmi " g ' hat ™" is f ™ a fi-d pi 

' ' A first V ™^ '»"'««*■> ™P«' views at poses distant from pose 
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our technique for sending virtua, views, their useful win be evaluated in a 
view-based pose-invariant face recogmzer abased face recog- 

I„ this thesis vve make two mam cent nbu >°" s . f ir J 1 ' pos( , mvar iant face 

niti on work is the first to systematjcaily «pk« h^oUem J ^ ^ ^ 
recognition. As we will d,scuss ,n the pr,or workjc io , ^ 

^developing techniques for synthesizing virtnal v.ews. 

1.1 Object Recognition 

. • . i™.t;oliv identify instances of objects that 
The goal of object recognition ,s to That is, given prior 

the computer has been tra,ned or programme^ ' "J^ of a sce „e to analyze, 
hnowiedge of a set of known . an * ^1 "and estimate the orientation 

object recognition systems attemp to ^''^ of recognizing a telephone 

of objects in the scene. For ex^pk, «" ^^ps c uttered with other 
i„ an input image of an office desktop scene mage P ^ ^ ^ ^ 

common desktop objects such as pen , ^s etc. J rf ^ 

match a model of the telephone, perhaps a 3D CAD mode , 
image containing the telephone. recogn ition, an invariant features 

There are three major approaches o object^ '^f ' ^ differ . 

approach, a model-based approach, and a v rew-base d^ ^ ^ ^ 
ence among the three approaches , how they dealw^ .^.^ ^ 

a „ce across different image parameters such ^™fmage parameters change, 
tures approach finds object features that d n chan e- P ^ The 

sidestepping the problem of actually ^^' ^h ™!s 3 D object models to P re- 
model-based approach, the most popular ap roach use 3 
diet appearance under different .mage , parame • » rf ^ object . 

— t=r ;::;^d. u* — - 

Wd representation, where » the ^ ^ ^ ^ 



CHAPTER 1. INTRODUCTION 



grey levels, some representations transform the image using the KL or Pn ■ 
correspondence When recognizing objects in an input image oneofthefW . ■ 

objects, one also has to model the projection process from 3D m n 

so the overall mapping goes f rom , of, . f t ? ° ™ a « e > 



matrix 7? u« P transformed b y » standard 3x3 rotation 

—0: Ei^E^^^r*^ a - 

when compared to the distance be^ o^^ ^ ^ " ™ B 
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model that needs only to map the image p 
Seal form is similar to equation (1 .1) 

p' = sflp + t, 1 ' 

or training stage and a matching stag. Th mo J the actU al on-line matehing 
that builds an object representat.on for 6 ^ ^ rf jjpwlellWknt 

stage. For the view-based approach th, stag ere y ^ ^ 

,„ the on-line matching stage, given an npnt toag p ^ 

Snd..^t.m^.^«r e, ^ d ^^ to 4e Stares and solving for 
the object. The matching stage invd^ » £ ^ ^ ^ how ftese 

i i i Invariant features approach 

arh the key is to find an object representation that 
In the invariant features approach, the key ^ The matchmg 

does not vary as image parameters f^ ^^ t J nvaliant presentation in 
pr ocess in the recognizer is then quit -J^^dd representation. This approach 

thns complicate the recognition P™** s . h uses the so - c alled cross ratio. 

A simple example of the invariant features appro ^ ^ ^ ^ ffi ^ 
Consider four collinear points A, c, f. ^ ^ invariaI1 t with respect 

hetween A «d B in the ^age. Then *e» t,o „ ^ deve]op geornetric 

to pose. Using methods from mvariant * », , V ^ ^ point 
invariants of a similar flavor for 3D planar J ^ [m] ^ a 

The color of an object is another mvanant eatu e ^ ^ ^ 

histogram of image color values to J**^TE c g ffort to detect faces under 
of object models prior to more detailed mem ^ ^ ^ ^ 

varying lighting conditions, «f ^ moufh, ^ etc . He shows empirically 
facial regions including the eyes, nose, cheex , 
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that pairs of regions exist where the average brightness ,f 

greater than that of another (e.g. cheeks TJ Th " * — -tently 

a face as a collection of snch L^X^" ^ * — ts 

tomlr:::^ 

Jacobs [4,1, Bnrns, W^^Z^ Sh °™ <« Semens and 

arbitrary set of p„i„ ts there „ ™"™» °» th " fo a 3D object consisting of „ 
modeled as a 3D clond of point Z^T^T ^ ^ ° b ^ ts » 

1.1.2 Model-based approach 

of the work in m odel-based m J;Z ^ "" dw pose. Most 

-ch as machined parts, ^ Z 'T ^ ^""^ ° b -' S 

the model-based approach are intensitv I t P0P " W re P rese "tation in 

corners, jnnctions, or edge n™^ Bv * " » " *» ™ b *s 

discontinnities, the repreLtZ' is X^' ^ 
based^ystems address the probJem of v^!„" ^.ng-contoohs; a> modei- 

^ZZ^zt^^ r odeUnd a 2D — «* 

bnilt from point featLs or edg i t L^T This"' ? 7 T™^' — »X 
centered coordinate system. Given aTrt" 1 m ° del " defi " ed il] M ° b j«t- 

pose that overlays the projected model onto th . ^ " and 
the model. While many LcK„2 ° T* fc,to " belOT »"« *° 

-heme of Hnttenlocher and U man 7 ? 1 11 / " 

correspondence and pose are hnked CL r ^ T defini,i °" of te ™ bow 
^odel-to-imagetransformaLsis^ 

a process that unfortnnately grows exponl ' n T COTres P°"oences, 
'ures. The ke, observation of VlZTZZ ^ T '° ° f 

featnres in forming corresponded s ^7 'fT ^'—oerof 
-nt correspondences are snmcient for ^ 
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, „ t lhtee model features to three image 
omy has to enumerate al. -"^^L then enters a verification phase 
features ^ examine all possrWe ^ The Here ^ ^ 
w here it examines the j«;«« Station, using the hypothesized pose to 
returns to the Ml detail of the mod »^ / the „ perfor med in 2D between 
project al. features into the ,msge pta. eMat S Related systems include the 
the projected model features and the image ' , oca , featme focus 

Legation tree method ^^^^J b Lowe [S9,. 
method of Bolles and Cam [22,, and ^ ^overcor- 

While the previous approaches organ -££~J From this viewpoint, 

the matcher tries to find a pose that tangs al g rf ^ ^ (r(ms/ 

into correspondence with im* ^^Tta lntroduction) . A matfie- 

techniqne (see Ballard and Brown [10], Chapte 4 g fa ^ probfem 

of tended recognition ™ bounded function cen- 

sor means that projected mod 1 fatur la and 
tered around the image features. Th.s idea 
^orithmicallybyCass^andBreuelN. 



As previously discussed, the view-based a^ach to ^-cognitmn rnode,s » 
object appearance simply by storing Md foI the case of faces, 

views may be from a variety of V^^™ that one wis hes to handle in the 
exp^-^J^^S approach, it is often said that the 
input views. Compared with the 3D model 0 PF & o( 

vfew-based approach trades memory for coml^tatmn^ ^ match; 

views takes more memory than a single 3D h,e ^ ^ ^ ^ 

task is less computationally expensive since^ P ^ ^ y , cw . 

„ K consider for a moment t is trad„,T m i,er m ^ ^ _ than 

based approach is often consider^ rU> be* -nls of memory and 

3D model-based approach since our ton rf object recogn it,on, the 

rather slo „ computational speed, I . n£ ~ ach pra ctical in real 

cheapness of digital memory may make 
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. Figure 1-1: In scaled orthographic projection, the obiert in * Qn 

coordinate system is rotated J J^?" ^ ™ Reentered 

cally projected along the .-axis, and then scld „ th J 7 ' 
approach samples object views at different , ^ ^ The Abased 

y-axes. S ^ dlfferent pairs of Nation angles about the *- and 

Application to pose problem 

of the object to capture chaL ,„ a T ' " '° eX ™P k vi <™ 

back to our discussion o scat orZ " ^ ^ R *™* 

freedom to pose, two traus.atio a J fetor 'Trf"' * **- - 

need to worry about all six of the A ^ r °' atio,1S - Do ™ dually 

«- two trauslatious, acJe 1"^^ ^'""^ f °" P— rs 

not change image appearance i„ ^ XiS) ' "° 

an object, we can mode] changes in 2D translate , t P ' ^ M imi « e of 
2D. In general, we can model chan j 1 T T"'? by the ™ge in 

by applying a planar tra„ sf o™T a S F"^" *°* 

example views along those pos. dimeusiofs ' " ' S "° '° tak <= 

pose parameters cha ge ot^t^c *" ^ ^ (M These 

a -P. planar translj „ Z^^?™"*^ 
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i • f ^ of keeping it fixed on the z-axis, these 
rotates the camera around this sptere ms e ^ ^ ^ ^ 

two rotational peters descnb « whe »- P spkere ^ tne de l 
he „ce this imaginary sphere » often ^ ^ ^ poses from 
acquisition phase consists of obtammg a set o 

the viewing sphere. h view-based approach typ.cally 

During the recognition of a new mput v,e ^ ^ ^ 

cycles through the examplev,ews, coming * ^ fte 

/ 2 D matching algorithm. By gomg rs , th e two translates, 
Mg ,es in depth are exammed. The remam ngp ffi algorlthn , 

scale, and image-plane rotat.on, 2D case of the model-based matchmg 

The 2D matching step can be seen as a spec ^ coriesp0I , dence „ r 

approach, consisting of ^f rfra f^ lteX the descriptiveness of the features, 
pose space. The amount of sear ^ features SU ch as points and hnes, 

tf the matching procedure ,s dnven us ng g modeWmage correspondences. 

roflwcomponen^ofa^ 

amp ,e views are chosen. or po Lns of it at regular interval 

sim pler techniques is to rf^^Z, sphere b, imposing a 6x6 gr,d on the 6 
Goad [601 samples 216 pomts on * « " cube onto the vi ew,ng sphere, 
faces of a unit cube and then rad ally pr« ^ fc example vi r e 

The object domain of Goad's sy* n 1 ^ of Breuel [23) , toy 

represented using 2D edges. In the v,ew b y ^ ^ ^ 3, 

Jo airplanes axe represented by A 2D edge . b ased representation 

ima ges were used for one plane and ^ ^ „ Breuers RAST algonthm, 

is again used, and the 2D mate l p «ve subdivision technique, 
an algorithm that searches pose p«e uang ^ a „ 

One of the problems of the ^T£* ^ ere too finely may result m h,gh 
ate number of examples. S "^V«*«* ri ' , * ,, f 
slor age costs, wh 3. s-plmg - j perf ^ attempts to find a compact 

, a: 5 :;:"" i ii* - - • An 
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aspect, and a „ 15m( ^UMt^TJ^f,**™' crea *' n K a new 

a collection of distinct aspects of an J T , L " MP<iCt graph ' the "°<*es are 
* the aspects are connected b a v^ tT T T ^ *" »*• 
» developed using 3D solid shape and ^ ^' J 116 the ° r * ^hind aspect graphs 

edges and jnnctions. g6S ' hat occur "> 'he configuration of 

tr g ; f f : p ris: iiix creati ; g the set ° f Mpects ^ *« 

Chakravarty and Freeman [37] 'wo nfwZT" ' S ' ' 

pects cWucferfsfe ni eM , ' d A ' i f , , ran,eW ° rk bu ' Ca "™8 as- 

characteristic views from a 3D object modi/ft TT" 1 ' * Sd <* 

cally computing the aspect. „,„k V_ . ' ..T* has focused 

on automati- 

compute the aspect graph by first « t ' ™ d Ka "^ PU 

catering similar vietslto aTp ts ™ o «v" S rt T """^ " ^ " d 
by the fineness of the initial sarnpl „g. f„T I "V'" 5 ' 6 ™ 8 appro « h * "mited 
sphere could stil, bemissed. E« I^T' W " h T" "H"* « 'he viewing 
wort by mapping visuaJ eventsTut o ^ vtl" "* ^ **** 

polyhedral objects; Ponce and Krifgman [ml t h " '° ^ "" h 

parametric surfaces. It should be added tha i . " J**"** C ° nS ' r " Cted from 
^ition strategies using their £ ^l^f 1 - ^ ^ 

acquisition phase. Howevejr ^ nlr comb Laf rf ^»<^i» the.Lw 
isolated examples, the example lews Z e t! ^ ™" - 
object. That is, Ullman and Lsr h ow h 't t tma " " 6W ™ WS ° f «■« 
as a linear combination of n J le v w " olT" " " °, ^ <» «*- 
for objects with "sharp edges" such as polvl d , k ™ WS m "^eded 

this result), with more views beinTrelirtd f ' [105] also showed 

are allowed. The 2D view represent! 1T S S K " ar ' ku,ated 

contourfeatures. The technique T^the feat" k"-"™ ^ °" >** ™* 
both the example views and new Wew b Z le A " across 

-an and Basri assume is externally supp, 7 7^Z ' " C< — ™- 
projection model; see Shashua [1211 fe yS ' S aSSUmes m orthographic 

combinations appro^h to the perspect™ ^ *** °" 

Overa,,, the linear combinations tech„i q ue shows that a set of 2D views „ eq , v . 
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compute 3D stmcture. Mem of objec t lecogni- 

tio n, can be read - a ^^.^f^ learn how to perform the recognr- 
approaches, the basic ,dea ,s to hav com? Th , s appIoa ch 

f,on task rather than having to explicitly cod 8 ^ poggio [26] . 

L recently been explored by Poggu, nd Ed ^ ^ , earning tec h„ iq oes try 
Giv en a set of example 2D views o an oh^e £ ^ ^ rf 

to construct a mapping from the space c, I ^ . g froro the ob t , 

vhere that property may he ar -tor viable ^ ^ 

0 otherwise) or an estimate of 3D o eet pose ^ ^ ^ y 

typ e from a sparse set of examples » po s Me ^ t0 

sloth (be. the appearance of an object * , ^ ^ ^ ^ v , ews 
the linear combinations approach, feature 

is turned. f the m ap P ing is chosen, along 

,„ this technique, first a math mat, ah form ^ ^ rf & 

wi th „ associated architecture tha ^ g pr ocedure, which » 
of simple interconnected nodes. Next m t^ rf d 

basically the mode, acquisition P n ^' J^^. t he training procedure adjusts the 
desired output » are presented to he net, , ^ / ^ ^ /(x) 

parameters of the network so the network lea d ^ fte ^ ; and 

After training, at recognition time a w . ^ . g b 

the output v is computed h, ^ ^^^^ "recognizes" theobjech 
indicator variable for the objec Sequent imate ; (se e Poggio and 

Both [107] and 126] use Radial B » ^ Th objec t 
Girosi 1109]), along with * ^ £™Z inputs x are taken to be the (x,y) 
domain is wire-frame "paperclip objects P riments , . rfafvely small 

coordinates of ^^^^ZL** of example views used to tram 
number of objects ,s used 5 or 10) an ^ m 
the network is in the tens (Poggm and Ede 1 ^ 

,„ another learning approach, MuraJ a»d h ar U ^ ^ ^ ^ obj 
me thod based on ^ ® "H«™«£ features, they represent object views by 
shape by looking at edge-based and point ^ ^ , 0 „ d 

the grey level appearance of the object w ^ a , 0 f 

ST :-ed S S s^e the viewing sphere and dinerent fighting 
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)«. -ample views i„ this eigj^a 2* '° ^«tio„ «f 

'"to the eigenspace and J/ ™ > ™ge to recognise, ;t „ projected 

object, This technique as been tested with We k ^ ^ ^ mod * d 
t-tag views (270) per „ bject . 0 J™ £ b -< * example views (450) and 

set of example views is the primary « ^ ^temnn.ng a s m all but complete 
-ore systematic and dense sampling of th view u ^ ' hat ™» 

approach, spend much effort in co„ec "g he IxTof ^ "* " «^^« " 
«ews rcq „i red by the view-based approa "7 ""^d example 

error analysis (Breuel [23]). Poggio and Fd 1 f '^ "^ematically using 
views may be sufficient' for a neCk aPprtcT" "° 7 ' around SO-,00 

As previously mentioned, in this th™« 't 
°f the number of avaiiabie vi „s ^ s ^ f """ ^ 
avadable, our approach uses a regain , " " eWS •» ob ^t are 

" ' le ««<■ -e, however, o„,y ^ ^l^™ °< ^ere. 

1-2 Exp,oiti„ g p rior Knowiedgeof Object Casses 

object - - - 

particular problem. A lack oftam p l j 2 s e^ .T^ * * 
learning network can p roperly ge Z ^ 17 1" "* ^ ° f «■* a 

-t™, Wing just 0 „ e view ^ "» «« of view-based face recog- 

nition performance to dLpL ^^jT ™™ *« « ««ld expect 
attracted by the -mplid^rftovfeX^ " Wfi ™' P -- ^ng 

- f .t is possible to address this p roW ™ „ •„ Th e " V T ' " W ° UW to 
that learning systems mJ ^ ™" ^» the -b^ ^ 

elasses of objects, such as manufac f £ f "? ^ * 'peciffe 
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insider the problem of generating a 
Given a single view of an "j* ^™^ been rotate, ahont the .axis 
ro tated virtual view o the obj ct^ay * ™ ^ ^ abou 

by ,5 degrees. To perform ^ most stralgM forward way to represent 

the 3D shape of the oh,ec , This appIoa eh works well when 

knowledge of 3D shape is with an exphc 3D m ^ ^ f 

the class of objects has a common, gene™ ^D low . banawid th teleconfer- 

ohject class like fa.es. Researchers m on^ute P ^ ^ ]n th 

encing, and V**"^™^^^ «Hh jnst one view. First the 
way to generate new views of a faee whe p ^ ^ ^ ^ ei 

sing ,e real view is textnre ""^^toi 3D mode , is projected onto the image 
to the „ew desired pose, and finafi * ro ^ ^ ^ ^ chapter pnor 

plane. This technique will be discussed 

«»*■ . ■ f ID shape can also be used to generate rotated 

Knowledge about symmetries of 3D shap ^ pogg , 0 and 

virtual views. In an early ^[^^ obje cts, knowledge of symmetry 
Vetter IHOI showed that for Mat eral.y ymm ^ The virtual view is simply 
al ,ows one to synthesis a virtual view from a^ g ^ , 

the mirror reflection of the real view. Comb.m g combinat ion of views, 

combinations leads to a surprising « * A-J ^ ^ ^ of a bilate ral.y 
tw0 views are sufficient for recogmt, n S nc v g ^ ^ ^ 

symmetric ohject essentially for free, it ^should P ^ ^ ^ for 

object using just one example view. ™££ symmetry it becomes possihle. 
recognition, when paired wit prior now J ^ ^ ^ ^ 

Before applying this technique for recogn ^ ^ ^ ^ ^ _ 
theoretical and practical issues to , »».* r ^ ^ rf symmet 

equ al to the original real view - whe oh e po ^ ^ ^ so t , 

isperpendicn,artotheimageplan. Forfaee^ ^ view of the face Second^ 
views hased on symmetry we need to use dellce between views, and 

tbe linear combination of views resnR J^J^a^ Thus, difficulty m 
uofW^ft.^^"^"^ the virt u a , vie, to recognition, 
finding correspondence will limit th i W » ^ .„ . negative way: o 
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z?onimce proHem more difficuit ^ ^ p _ ^ ^ ^ ^ ^ ^ 

1-2.2 An exa m ple- based approach to ^ ^ 

— ~ - ~ ing PriM 

will be represented by a se t of 2D v.ewTof t H ' T*" P "' M kn °"*°Se 
objects are representative of the vlrial C ^ f 7** ' he 
prototype object we assume that tW " 

across the object class. For each 
Prototype objects for which jnst 0 JZ ^ " " * ™~ 

View that is available for non-prototvne „b- , T Part.cnlar, let the single 

explore will then require („ + ,) views ™ „ ' t , ^^e-based approaches we 
a view of a non-prototyp , or nove ob fec,7 ""' W« ■ Gi ™ 

wi» generate a view nt ^ of the u^X^ ^ ^-b 
vJ: oe X a m p,e-base d a P proacheswi 1 lbee X p,oreL^ 

this 2D deformation onto the n ^ l I ^ ^ T1 ™ ■»* 
or warp, the novel image from the .and d " def0 ™ ati °" «° ^ 
"ionehas been explore P reTo„ s ! bv B "'^ ™e tech- 
text of an "example-b Jd» pp to T 0ggi ° Wi ' h ™ «- 

prototype coefficients. Then sy" Lie h e ? I ^ yK,<iing * *« ° f 
tahingthe linear combination i^^tC * '/T^ P °" * 
same set of coefficients. Using this annro T I P ° Se " sin 8 the 

Poggioand Vetter^OLitisp^,"' - ^ * U«I and 

Pose to a particular virtual pose. a d,rec ' "^PPing from standard 

applal^ra^ 

wi« aot be sufficient for our ^2^^ add ' ti0naI ^ ^ 
- exper-mentally. Th e ca5 e of 3D models lorn ' 



sys t e ms f0I the 30 modeling o f faces - ^^^^12 
live sensor such as the ^"~^L^o, for the system has 
Ml outside the scope of our recogn .fen-trom one-v.e ^ ^ 

p J of frontal and side views of a face (see Wallace [6]). 

1.2.3 Related work 

. , w * r Mf9l as a method for incorporating prior 
Hi^e^WOseiM^^^™ ,„ the stendara learning- 
knowledge into the learning-from-examo fe f<™*» ^ of hypothesiz ed 
tam-examplesframework.t ^ -^^^L,. i,.,*™-^ 1* . P- 
functions G, where one can h.nk of G . mch backpro pagation, 
ticular neural network architecture^ A lra ™" S P input -out. P ut examples of /, 
chooses a particular function , 6 G based on a "*o.£P P ^ 
(xi , /(Xj))? where . U ^ ^ task easier. Abu-Mostafa 

sr^:^ — • — 4 fM — tlng hrats 

into the learning procedure .^.^ ^ partitioM 

For example, one type o hint -J, * Jhe ^ ^ ^ % = 
„, the input space X That is ^ .^.^ ^ fce d ,„ ,„ 

x, x' € X. implies /(x) - J IX )• in f h< _ fact that recognition 

tan-examples approach to object image plane. In this case, the 

is invariant to translations, scales and rotation different planaI 
partitions of X are collections o image of a art cuk „ ^ ^ ^ 

transforms. How do we modify the learning k the invarianC e 

be "absorbed"? Keeping in the for leaIning fro m examples 

hint itself is represented using example. The ^metW ^ ^ 

of hints is discussed in the context of the ^dta*^ _ s ^ 
the case of a normal input-outpu pair MWg*^* ^ network to modify 
output „(x) and feeds the error (y(x) - -flx» ta* hog ^ ^ 
weights. An example of the invariance h „t is a pa, ( ^ ^ ng 

and the error fa(x) - yMf - « "«* ^2 identify different examples from the 
examples of invariance hints as long a, we can identity 



16 

CHAPTER 1. INTRODUCTION 
- -coded as a nenraf „ ^ t^ZT- TT 
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the domain knowledge can be used „ I T Pr ° b ' em - By " ex P lai ""> 
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In the area of object recognition, both hints 1 EB 7n H T ^ 
edge of invariance can be used to assist „ 1 USS h ° W prior know| - 

own work, hints is closer s Lee the T ^"^'^ ^t. Relative to our 
similar to our idea f IZv 7 T °' ™P>- ^"ggested, 

by Ahu-MustafaletrrS;; t^T^ ^ T"* « 
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1.3 Application: Face recognition 



As already stated, the goal of this the 

—a; 



* recognition involves matching a new input face 



against a 
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database of stored faces. That is, we start ^ 

set: — ^ - -s- - * — *■ ^ as unknown - 

1 3 1 Recognition and related problems 

„. „ „ r ^s^asgssff^sz 

of categorization, where the goal s to as s,g > gen t ^ ^ ^ ^ 

a VMi ety of different object classes. ^ ^ ^ ^ 

While a system that .denttfies faces in thesis we ^dress 

to^^^S^TXtSh enongh to al,ow ns to 
the identification problem. The .denfficafon p The need f or face 

explore the nse of real and virtnal v.ews ,n d and having 

detection is avoided by taking face .mages ag ,n t a nn.fo m g ^ ^ 

the face take np most of the mage see F '«- ^ ^ e earing face detection systems 

^Arelatedproblemtofacerecognitionis/nce^^ 
while not falsely rejecting valid inputs. 
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1.3.2 Motivation and difficulties 

of the cues 2 tZ'^Z^ IT* ^ in ' eraCti ° n ' " * * «« 
center of attention during le s 2„ S t 7 °" em0 " 0na ' ^ ™ d * " ^ 
used every day in normal Z 7T "*»g-t,o» is an important skill 

authentication ^1^1^^ * 7^ ° f US « 

rtims, wiiere cameras are common y a readv in ikp inrt c 

human-computer interaction, workstations with cameL would be ^ ble to tZ ° 
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What makes face recognition difficult? First, as with generic object recognition, 
the appearance of a particular face varies due to changes in pose, lighting, expression, 
and even factors like age and change in facial hair. Variation in pose and lighting 
conditions is a difficulty shared with the more standard problem of rigid object recog- 
nition, as faces are examples of 3D objects that change appearance when rotated in 
depth or lit differently. While pose and lighting changes are fairly well understood in 
the computer vision community, the nonrigidness of faces seen in expressions is only 
now being modeled, and factors like aging, make-up, and changes in facial hair are 
usually not even considered. Overall, the variability in appearance complicates the 
modeling of faces, for the recognizer either needs a face representation that is invariant 
to these factors, or it needs to model how these factors change facial appearance. 

Stemming from the "subordinate" level nature of face recognition, a second diffi- 
culty is that faces form a class of fairly similar objects - all faces consist of the same 
facial features in roughly the same geometrical configuration. As a fine discrimination 
task, face recognition may require the use of subtle differences in facial appearance 
or the configuration of features. 

1.3.3 Prior work 

Prior work in face recognition has a history in the computer vision and pattern recog- 
nition communities going back over 20 years. While many face recognition systems 
have been proposed, most follow a typical sequence in terms of building the recog- 
nition system and recognizing new inputs. In designing the recognizer, the key deci- 
sions are choosing a representation for faces and an accompanying matching metric 
for comparing faces. For a given face image /, let R(I) be its representation and 
D(R(Ii), R{h)) be the distance between images h and I 2 . To acquire a database of 
individuals known to the system, face images are taken of each person, converted to 
the face representation, and stored. Let us say that a total of n model images M,- are 
taken, where 1 < i < n and there may be more than one model image per person. 
To recognize a new input view /, the input face representation R(I) is computed and 
matched against the model representations 

i min = arg min D{R(I), R{Mi)). 

The person identified by the recognizer is the person in model image M, m , n . In 
addition, some systems include the notion of rejecting the input if the best match 
is not good enough. This is commonly implemented by requiring the distance of 
the best match D(R(T), R(M imin )) to be below a certain threshold. In general, the 
rejection threshold in classifiers is either empirically determined or estimated by using 
techniques from statistical pattern recognition. 
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Ideally one wants a face recognition system to handle as much variation as possible 
m terms of pose, lighting, and expression. However, in most prior work, especially 
the early systems, the model views M { and inputs / were restricted to frontal pose 
neutral expression, and fixed lighting conditions. This left some of the complexities' 
of the problem unexplored, such as the problem's 3D nature from rotations in depth 
and nonngidness from expressions. It has only been over the last few years that 
these restrictions have begun to be lifted by looking at multiple view-based systems 
and flexible matching procedures. Outside of the recognition problem, there have 
been recent studies on analyzing faces under different lightings (Hallinan [64]) and 
expressions (Essa [51], Yacoob and Davis [142], Beymer, Shashua, and Poggio [19]) 

While the prototypical face recognition system deals with intensity images of 
frontal or near-frontal views, there are systems that are based on 3D range mea- 
surements and others that utilize the facial profile seen in side views of the face. 
More will be said about these systems in Chapter 2 on existing work 

In discussing the design and evaluation of existing face recognition systems, we 
focus on the following important issues. We only give an overview of the issues here- 
Chapter 2 provides a more thorough presentation and a listing of references. 

1. Input representation. How are images effaces represented? Face representa- 
tions, which to date have focused on viewer-based, 2D representations rather 
than 3D ones, fall into two classes, a geometrical features approach and a picto- 
rial approach. In the geometrical features approach, first a set of facial features 
are detected, features such as the iris centers, nostrils, corners of the mouth 
outline of the chin, etc. From the feature locations, geometric measurements 
are made and gathered into a feature vector. The resulting feature vectors have 
been fairly low-dimensional, usually 10- to 20-D, and the geometric features 
include measurements like point distances, angles, and curvatures. 
The second approach to face representation is pictorial in nature; the represen- 
tation primitives are fairly "close" to the original face images. The template- 
based representation actually stores pixel intensity values from subimages or 
templates around the major facial features such as the eyes, nose, and mouth 
The pixel values may be from the original intensity images or on versions of it 
preprocessed by gradient or Laplacian filters. A related filter-based approach 
applies filters such as the Gabor or Gaussian functions to a sparse set of im- 
age locations and represents images as a set of filter outputs. Another class 
of pictorial approach decomposes the grey level image as a linear combination 
of eigenimages which are derived from a principal component analysis of an 
ensemble of representative faces. 
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Comparing the geometric and pictorial approaches, recen momentun favor 
the latter Implementing the pictorial approach is certamly s.mpler than the 
geometrical approach, which reqnires locating facial featnres. Recent systems 
for face recognition [129)120] [28][103] have chosen pictorial representa m„s over 
geometrical ones. A comparative study by Brnnelli acd Pogg.o [28] avors the 
Lplate-based approach over a typical feature geometry approach when reco£ 
nition performance of both are compared on the same dataW. Ind«d the 
geometric approach may be too impoverished to snmcently drscr.mmate faces, 
especially as the database size gets large. 
. Tnvanance to ima gl n 9 conditions. Is the recognizer designed to operate .under 
changes in pose, lighting, and expression? Different approaches have been taken 
to handle the resulting variation in facial appearance. 

. Intensity filtering. Preprocessing the image using differential operators 
such as the gradient and Laplacian will introduce invariance to simple light- 
ing changes. For example, changes that can be approximated by adding a 
constant to the image, such as changing the ambient illumination, will be 
factored out by differentiating. Using a normalized correlation metric, as 
we will describe later, accomplishes the same thing. Handling more com- 
plex changes, such as the lighting direction, requires more sophisticated 
methods (for example, see Hallinan [64]). 
. 2D geometrical invariants. Transforming the image using the Fourier 
transform magnitude (Akamatsu, et al. [5]) or computing autocorrelation 
features (Kurita, et al. [82]) provides a representation that is invariant to 
2D translation. The Fourier-Mellin transform, which in addition to 2D 
translation is also invariant to scale and image-plane rotation, has been 
explored by Fuchs and Haken [55] [56]. 
. 2D geometrical normalization. If a couple of facial features can be located, 
such as the eyes, then the face image can be resampled using a similarity 
transform to normalize for the effects of translation, scale, and image- 
plane rotation. Since our face recognizer uses this method, more will be 
said about this in the next section. 
. Multi-view representations. As already mentioned with the view-based 
approach, some face recognition store many views per person to handle a 
range of rotations on the viewing sphere. 
. Elastic maUUng. von der Malsbnrg and collaborators [83] [91][141] use an 
elastic graph matching technique that is capable of matchmg model faces 
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with input faces even when they are separated by an out-of-plane rotation 
or difference m expression. 

3. Experimental issues. While the previous two issues dealt with the design of the 
face recover, the following issues are central for its experimental evaluation 
• Recognition statistics. Face recognizers are evaluated on a set of test im- 
ages, images which are usually distinct from the example views stored for 
each person These test images may contain people who are both in and 
out of the database, with views of the latter ideally being rejected by the 
recognizer as unknown. For the former group, the group of test images of 
people in the database, the recognition rate is the fraction of those images 
correctly ]d entified by the system. Relatively high recognition rates, rates 
» the m,d to upper 90%, have been reported on mid to large size databases 
(see Baron [11], Brunelh and Poggio [28], Cannon, et al. [35], Pentland, et 

. Number of people in the face database. The more people there are in the 
face database, the more difficult the discrimination task becomes Intu- 
itively one can think of our input space for representing faces as becoming 
more crowded with clusters of example views corresponding to each person 
Most prior work in face recognition has dealt with databases on the order 
of tens of people. Recently, databases with hundreds and even thousands 
of people have become available. Examples include the new database be- 
mg collected under the Army FERET program and a database collected 
by Pentland, et al. [103]. To be commercially viable for, say, security ap- 
Phcations, Jt is generally agreed that face recognizers need to be proven on 
these larger databases. 

. Variation in image test set. As mentioned previonsly, face recognition is 
difficult becanse of the variability in appearance of a single face dne to 
changes ,n pose, lighting, expression, and even changes in facial hair or 
the add,t,on of paraphernalia such as hats or glasses. Most prior work 
has Inrnted the scope of the problem by drawing both example and test 
views from a frontal pose, fixed lighting, and nentral expression. As we 

see ? ^^'r 2 ' m ° re reC6nt W ° rk ^ eXpa " di "« the 

seen ,„ test sets, tins demonstrating face recognition nnder more general 

imaging conditions. 

Overall the important experimental question currently being explored in face 
recogmt.on research is whether the high recognition rates seen in earfier woA 
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Figure 1-3: The view-based face recognizer stores 15 example views per person. 

can be sustained when the databases are expanded and the variation in the test 
set is increased. 

1.4 Our view-based, pose-invariant face recognizer 

In this thesis we explore the problem of recognizing faces under varying pose. In other 
words, the input views presented to the recognizer for identification are not limited 
in pose to a frontal view, as has been the case for most prior work in face recognition. 
Input pose is allowed to fall within a range of acceptable poses, the difficult part of 
which is to handle rotations in depth. Our goal is to demonstrate that face recognizers 
can be extended to handle a range of rotations in depth. This can be seen as part of 
the longer term goal of building a face recognizer that works under a variety of poses, 
lighting conditions, expressions, etc. 



1.4.1 View-based approach 

Our pose-invariant face recognizer will use the view-based approach for recognition. 
Rotations in depth, or the rotations about the x- and y-axes in Fig. 1-1, will be han- 
dled by sampling a number of views on the viewing sphere. The recognizer will store 
15 example views per person (Fig. 1-3), including 5 rotations about the y-axis and 
3 rotations about the x-axis. Recall from our discussion of the view-based approach 
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exll 2 vie m „f ^ f alg0ritl,m " US6d '° ma ' Ch inPU ' Vi ™ S gainst these stored 
example v.ews Our faee recognizer will solve this 2D matching task by matching 
eyes and nose features between the input and example yiews. Tkl will geome^alW 

n0r TheTan 6 7"' 1 M ** ™« 

The range of acceptable poses for our recognizer is determined by the sampling of 
t y.ew.ng sphere ,„ Kg. 13 and by the 2D geometrical normalization Z 2 
F.rst the samplmg of the v.ewing sphere determines the range of rotation angles in 
depth, „,th rotat.ons about the ,-axis within the range ±30° and rotation aW tn 
*-ax,s„,th,„thera„ge±20». ^.taJhJS^ 
hat both eyes will be visible, which is important since the'eye loc „s w , b used 
o geometr.cally normalize the input. Tolerance to the remaining poS e p arame ers 

to locate the eyes and nose for geometrical normalization. Our feature finder to be 
n "d 6 trl ? T^" * ~ ^ h "W« »*«™ ""hi ± 45° and 

21 „, ± W b ' f ° WeVer ' I mi " 0r Variati0 " S " ™ T ->'°wed, a 
range of ±20% about an expected scale. The feature finder could be extended in a 
straight forward way, though, to extend its operable range of scales. 

Bes.de s the v.ew-based approach, other approaches, namely the 3D model-based 
approach, could have been taken to the problem of pose-invariant fa» 

rend T" TT"* " ^ Cybe ™ re — r. ™ese 3D models cl oe 
oTmna d^ r y d t ed P °" f ™ computer 

' eqUiPment fM aCtiVC dePt " SenSi " g duri "« the ™*> — <>» 

• a 3D rendering step to synthesize 2D views of the face from the 3D model. 
The view-based approach, on the other hand, requires more memory to store the 
example v,ews, so the question of the 3D model-based versus view-bLed » 

rar^rjted^to^^ 6 V T^™'^™ «~ « ■ " 

-i^^-^rizs because of its simp,kit » - !f ii is a 

1. multiple example views available, and 

2. one example view available. 
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viewn.5 viewn.3 viewn.8 

Figure 1-4: Faces are represented using templates of the eyes, nose, and mouth. 

examples for each person, in f „„, w w if we only have one view per person, 

approach - lots of real data ,s avadable But what,fw^ony we ^ 

use the view-based approach? In the second p remaining 14 views. 

m4 is the single available v,ew and we tr to syn hes. rf 
These synthetic views, or vrrtual « be ^"f e .ba S ed way from views of 
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^discussion of the details of our face.cogn *^££ZS£. 

r^ra^ 

experimental issues. 
Input representation 

r «w Precognition work that uses templates (Burt[32], 
Motivated by the success of recent face recogni ^ ^ 
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*(1>(I) ' 

of invariance may provide immunit^^T \ . add, " ve constM '- This kind 
overall ambient hating ■JT^tZT ^ " ^ ^ * ^ 
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can help factor out the slowly varying low „ . addlt, °"' th * Laphacian 

that illuminates some parts ofZ "! '^^f ° f » >°«1 light source 

sue, template scale, dClt^Z. 1? ^ ^ ^ 

smaller templates keeping less detail , . ^ match '"S P ro <^> with 

evaluate the face recogni I "jj"'^'/"^ fcter computation times. We shall 
preprocessings. ^ Comb ™"<- <* template scales and image 

Invariance to imaging conditions 

XtzziTc^z^r ; pose ' lighting < and « 

mentioned, a view-LTdTlT ' d,i " g Wiati ° n b P ° Se ' As ****** 

coverdifier'ent ^^^^^^ - - * 
pose parameters, 2D translation, scale and im™. 1 J " depth - Th « remammg 

:r ~ — ^ff^jurr 

feature. Fig. l- 5 (a) shows an i„„ . ■ °.i , lmeS and one nose >°oe 

finder. When ns ng tl feaZ s to'" "f T ^ " * » **» 
say the example viL show t F t , X Th T " ™ W > 

— m that aligns the input £ 
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, Ifrtt If the input and example views differ by only a 2D 

the example view (F>g. l-5(c )). B the >"P form If tke inp „t and example 

transform, the affine -amphng ».l n do h s tranrf ^ _ & 

views additionally differ by a rota t«,n ■ n M ^ This 

nose features. 

After the fe atore finder locates the eyes - 3£ 
effectively searches over the rotation p— ^J^J^ input to a 

correlation. 

While the emphasis of onr face recover is 
design provide for invariance to m.nor var* .^.^ ^ , igMing con _ 

me ntioned for onr template-based repre en a^on . jmages ^ fcy usi „ g 

ditions is achieved by preprocessing ^ ^ illumiM tion levels 

normalized correlation. Bnt tl»s,s mostly fate *« ^ mi „ or 

and contract, and does not extend to ha die chang^ ^ pro _ 

variations in expression, onr 2D reg,strat procedure ^ ^ ^ ^ 

This secondary registration stage wi. 



CHAPTER 1. INTRODUCTION 




taken that sample random poses from 



Figure 16: For each person, 10 test images are 
the viewing sphere. 

Experimental issues 

To evaluate the use of real and virtual views in our W 

an .mage data set of 10 test views per pe on si "T"*' ™ W C ° 1,eCted 
are taken under a variety of rotation »S I J " F * ^ the test v ^ws 
order to test pose-invariant re 1 ^Z^^, ^ J? ^ "» ^ ^ » 
which is larger or comparable to most fa- 6 H ^ " ^ database > 

though one recent database ^Z^Z^T^ * ° f ^ 

8>ve a quick preview of recognition ate S 1 ^ ' * * ^ T ° 

best case scenario for virtual views is 85% ^ ^ ^ * ^ 98% ' «»e 

1.5 Contributions 



15.1 Main contributions 



facial features with teliatelTl ge ° metncal ""Nation using a few 

view ing sphere h t t t"i:r tem t tes from mu,tipk ^ - 

systematic study of pose-invariant f, v,CT *. are available per person. This 
state of the art in face re ~ ^ '° ^ ^ "» 

with frontal or near-frontal riews ' ^ '° °" S "»* h * d dealt 
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« available, but prior knowledge , s a^la >e ^ demonstrate 

generating virtnal views allows «»«££*J> -d linea, classes, to 
the application of two techmque , P"»? {mm jmt one exam ple view, 

the problem of generating vrrtual v.e y ^ obtains a 

Usin g the combined set of one real « I- fp ^ ^ 

higher recognition rate in our r^^S^- ^ generalization 

£^^«^^^<*~ 

space populated by the virtual examples. 

I 5.2 Secondary contributions 

■ W of the thesis are motivated by the feature correspondence 

. P _ Ttt o : tit: 

run prior to the face ^Tl^X^ templates of the eyes and 
As described in Chapter 3, .t uses a larg WMe our face re cogm- 

„ose region to thieve person- and m of input and 

f.on system uses the feature »n for g^me 8 lions that pro- 

example views, the feature finder wdl be use! ^ ^ to .^.^ 

cess images effaces. For msta nee h * fte first frame . This 

of two different people. For example, ires on the „ rd er 

of tens of feature correspondences rather a 
feature finder. Onr face "vec or.ze r , descr ^ ^ ^ tQ ^ 
dua , representation 0 ^^iing of intensity values. The shape component 

S£J^^w-^« ----- — 
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CHAPTER 1. INTRODUCTION 

1-6 Roadmap 

<* that use 3D models for face syXT c " /'T^ " ^ meth " 
Prototype, 3D models „e an aHernati v 7»l , * W " h """S 2D in »i- <* a 

The next two chapter,, 3 and 4 d "w" """^ ° f ^ 
results with real views. Chapter 3 oca 2 on h 7 Tf Md ^ e *P e ™ental 
'he two eyes and a nose feature (ZZTZ ft ""^ " hich io "'« 

Chapter 5 introduces a Vectori^ ; " ^ f - ™°S"i~. 

generating virtual views. A ^^^'7^'""' Wi " be 
vectorized representation for face win b , ** Mta »«»^ computing the 

Chapter 7 0 „ virtual vi ' f ^'^ed next in Chapter 6. 
f-s using the ^7^," ™ ^ ™ M <* 
-ing these virtual -iewstafc ^ T I r^' 

Chapter 8 closes the thesis Z ^^T^T™^'^ 
Appendix A discusses the face dlt, I * W ° rk Md "elusions. 

Appendix B descrihes the J^Zl^Z T ^ 
thes, 2 ,„g virtual views. f ' he llnear class approach for sy„- 



Chapter 2 
Previous Work 



vision and computer graphics ^ faces are processed to perform taste 

vision and pattern ^^T^^T^ «»*— ^ * 

that of rendering realist, '-f °™ ™ relevant for this thesis. Since the - 
Both the analysis and <^»*'™ d thls Aapter explores face recogmt.on 
thesis topic is face recognit.on, the fi s part ol ^ rf ^ thesls co „. 

"nroredetail than proved in "^fL, so this chapter also discnsses 

o 1 Face Recognition 

« „f face recognition and gave an overv.ew 
of the major issues facing ex,st,ng face recog imental issue, 1» tta 

re presentation, invariance to ^ 6 »* ™g a more extensive list of references. 

^ be listed at the end of this secrion. 
2.L1 Input representation 

in the space used to represent faces As correla f l0 n, the mam factor that 
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proaches to i nput represent a tin, 

representation. """" P ' ct ° nal approach that uses an image-based 

Woog, Law, and Tsang^J, Brunei JTdjlS^" ^ ^""M. 
faWb^d systems begin by locnting £® ^ " ^ H "»«W These 
a tbe corners of the eyes an(J mo ^ *»*™, mclnding snch features 

a'ong the chin, etc. These featnres 1 ^ m "7" th. conto„ r 

Procedures that are cued on edges, h oZZt, r , USmg ^''^ h ™°'- 
;" h l ^.'r*. «J Oefcmabl temp ( « ™ I^'-- <>' the gradient 
The spaUl confignration of facial f eat „r« . ' and Cohen[1441) 

pensions typically i„ clu<le ^ ^ ^ a featnre vector whit 

f°r the systems listed above, the iiZt lt TZ' ^ CU ™' U - 
™nnd ,0 to 50. Craw and Cameron S ' V n "*» «*• 

represents featnre geometry by disnlacem , ' ^ " SeS S n ° veI fe atnre vector 

of featnres, thns representing ^ Zn ^ "~ 

represented by featnre vectors, the ^ " ^ "* Once faces are 

Enchdean distance or a weighted norm „Ter e d I " M * Ured * 'he 

some measure of variance. To identify an Ih ^T^'"™ osoally weighted by 
ehoose the mode, closest to the i„p ut ^ale Tnl ^ recognij 
d-nss more in sectioil m ^J^^T ^™ ^ace. So far, as we will 
frontal v,e„s, a, the the geometrical ™" ntem j T °" 4 » »™ted to 
■nvanant to face rotations ontside of the"Z e " ^ ^ " "* 

Prctonal representation, faces are repre „te n" ^ S ^™. the simplest 

« by snhimages of the mnjor f^l^^ "* h ™^ s <* *e whole face 
Template, mages need no. be taken from , h To " f Y^' Md ™n.h. 

-e g ra d,ent magnitnde or gradient ve^ emZ"d T ^ ^ «- 
A" mpnt face is then recognised bv / '° ge ' invariM « to lightW 

Really nsing C o rre , atioil ^^^^ «« ™* 
corre,at,o„ on grey level templates. His system M t'°il ^ "~ 
Wd a pp r0 ach of Brnnelli and Poggi o( 2 1 T h Z * '""^ ' he «™Pl— 

Sradrent magnitnde. Gilbert and S£ U ~a,i Z ed correlation on the 
of the Br„nel,i.Po gg i osystem thal fe f ^ ^ " <*> "™ hardware implementation 
presents and mntches temp, a tes nsing a hi« Bnrt[32J 
B.ehsel s templates,, nses the , and /com" -nctnre, J 
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and reconstructing face .mages- It can be of pixels ,„ th 

reducing tne dimensionality o ^ ^ ^ (» ^ ^ U £ 
templates to the number ^'6^'™^ J,^, one mn S t assume that he 
nsed in the represents To apply pnncma V ^ spaMimg set 0 

se t of all face images is a hnear "^^^ is fou „d by applying prmopal 
eigenfaces, called "face space by Turk and nted by their project™ 

components to an ensemble of face .mages • ^ " > , mcipal co m P onents to 
oZface space. ^ ^ ^ ^^^L to to^^ «- B-jj- 
face recognition. Akamatsn, e< J. 5 first prep C ameron[461 apphed 

transform magnitnde to * ™ £ that haV e been warped to move 

principal components to "shape-fe faces, Brone m[n7] used prmcpal 

components on templates of he ma or c a, ,, onal cost . Pentland, 

comnarable to correlation but at a f act.on f blems: recogmt.on 

Saaaam, ana Starner recognition under varying 
on frontal views in a large databa e ot o ^templates" . Knhy and 

lo images of faces, generating a new u - ming autoco „elat,on on 

image space. Kurita, Otsu and ^ of „ p to 2 „d order are 

the original grey level .mages. 25 antocor tnIou gh a tradit.omdL.n ar 

nsed, and the subsequent 25D rep e-t « p ^ s , 

Discriminant Analysis claas.fier. Cheng ^ 1 I ^ ^ co , umns 0 the 

Vafne Decomposition (SVD) to ^ et use SVD to define a to» 

ima ge are actuaUy interpreted as a m r . Ch g, ^ ^ , 

set of images for each person, winch , . ^ for f e s by 

person has their own space Hong creates ^ analysis . Ramsay, 
unning the singular vafnes from .te Ul^ i«™ 

e, nl.[1131have used vector onant » on to p ^ rf 0 

into its major facial features, th to u rep here .„ h to choose 

best matching templates from . codehooL T P ^ ^ [m] h use d 

the codebook of feature templa tea Nak - ^ of th fa ce , 

•isodensity maps" to repent toa The g ^ fa .^.^ 

. divided up into eight buckets defb ng ^ ey ^ Md face matc „,„g 
the image. Faces are represented by a set o 
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ispe r fo rme d U si ngc0rre]ationonthesebinaryim 
^onnectionist approaches tnf„ 

Weng, Ahuja. and Huaw'mi F, ■ .J 5 ?'' EdeIraM . Refold, and Yeshur-n'491 
*"^e networks J^tt™™ 

Peaches are similar to the ones described To ^ th ™ •>- 

-m ma ti„ g „ odes , inputssuc]] asgr XT im I 6 ' m f'>^' of simpfe 

object reported by the network. BypresT^ill ^ PW ° bjeCt ' d *™mes the 
the network i s trained , lsing a "lea™D ' ! T" g "* ° f m ° dei ! ^ ™ages 
Among connectionisf app otl t 2 >' ^ ~ P *™- 

-snes are input representationa, the „ p u Tv ! "d ,7 ' W ° m0St ™^tant 
ore. As previously mentioned the ill * ™ ° Vera " architec- 

™ P s, flWj uses a threshold bLr fLl^ZT ^'^i 

'he grey level image. A variety of neJorf lit ' ^""^ »*• applied to 

mnltdayer network trained by UpCtSv^h™ ^ A 

been explored by [53] and [94J. I„ . , ,ZT W™"*. has 

"°" o-g gradient descent 7 3 ^ ' N ' ra, '" S * ^ f ™<> 

memory that Wis" the pattern in me! , 11 * r " r " rrei " autoassociative 

a multilayer "Cresceptron" that IT 7 c '° "* (135] use, 

-"f d -~r:7:Er;r:r - 

Lades, „ a ,. (8 3] and Mailjunathi Chellap^ 7 !' 1" ^ ^PProach, 
a-lasticgraphs of local taxt^'fej^l? ^"^^H ~t faces 
f.^' " hkb st °- the distance between th! tt „t7 ,7 f ^'"^'"'begraph's 
■on ,s represented at graph verlices by ^ -tores. Pi c , oria] infom]a . 

° the ,mage a t fe atu re ,oc a tio„s. dJ^ II T ' Gab ° r ^ 
de formed to match the model graphs Mat ! f '"^ *"* ^ is firet 

? "0 geometrical deformation and the I u J*** by ^ming 

measures 

described in terms of flexible graphs th Z , ^ fiUer res PO"*es. While 

matching flexible templates. ' ' * «" be -ad as representing and 
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condi tions and expe^ental issues such as .cognition rates. 

2.1.2 Invariance to imaging conditions 



2 12 invariant o ~ 

mak es face recognit.on a most systems do provide some flex.b.hty 

^^representations 

mteringthefaee image withabandpas « rlta P ^ tQ , lghtmg 

invariance to lighting cond.t.ons. lighting effects while still preserving 

is fawpass, bandpass filtermg should remo « the l g^ g^ ^ ^ lighting 

the higher frequency texture mforma » m h ^ ^ M fte face , 

effects are lowpass breaks down from the periphery. 

whi cb nsually happens when the face * ^ the Founer 

To provide shift invanance, hand]es th e translational pose 

transform magnitude or autocorrelat.ou Ita y s ^ ^ pararae . 

parameters, retiring other -echaursms to ha * for invarianC e 

ters. Using a ^ I image-plane rotation and scale 

(e g. ATR), Fuchs and Haken[55l[561 actor h take the Four.er 

parameters in addition to the ^nsUtmn^ parame^ ^s , ^ ^ 

transform magnitude, wh *h P-des ^ R ™ map> a new rep resenta- 

re presentatio„ is transformed '" to a °™^ e lm age become trar,slat,onal 

Z where scale and ^"-^ Lansform magnitude again then 
parameters in the new ^ e rot ation. Using invariant representa- 
provides invariance to scale and ^ 01lt of th e image plane, on 

tions to handle the remammg pose P«"*£ been tried, 
complex 3D textured objec s hhe fa es has ^ . ^ 

By finding at least two facal features -usually \ aa ^ ae rotation. In feature 
face can be normalized for transfat.on s~le nd gj for sca)e „ y 

geometry approves, d.stances m the u ^ 
dividing by a given distance such as the mteroc fcy a „ d 

,„ template-based systems, faces are ^ ^ approaches , of 

scaling the input image to place the eye at fix ^ nMmalizatlon ste p 
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s'pw:?;t::j;:ct er that ^ ^ — — - « ^ vie „ ine 

<"g with the Variant representations and " , g ChMgeS ''" " ose ^ 'W*- 
current systems treat face ^Z^™ ^ ^"^» bribed above, 
ceptions, however, as some system hav^m , a ' There ™ « 

»g Regies to dea, with some ^ ™ 't~ d «-ib,e match- 
A^IS], fonr slight,, , otated ^dd ^1" f 

add, "°" <° a fr onta, view. In Otsn[82] ,1 „'„ ^ ■» »««' » 

are used ,„ building a Linear Discrim L„ t Anl , ? £ 5 ° ™ages/perso„ 

are extracted from a videotaped session with 7 ^ TheSe trai ™« ™*« 
head rotations. Flexible graph matching , T ^ aW 
•o enab,e matching one Z H e t t'TZ j' ^ T ^ ((91 )' ^ 

expresses. What distinguishes mvappl f f a " d views facial 

allowed variation in viewpoint and theTe o ™ ™" ^ a wider 

"virtual" model images. * ° f "'"^ ^ >™wledge to generate 

2.1.3 Experimental issues 

in «"> -perimental evaluation ™ "*° d -«o„, the important 

-te, the number of people in the dataW IZh' -T'™ 8 "* 
th,s section we explore these issues in mo7e deL, Var,a "°" " ^ 

approaches, the modeling images Je usel l tr * 1 »™"™st 

After models of the faces « S o c I« ' " * T^"^ 
™nmg the system „„ a „ i mag e test se^Th ^ '°" are Spiled by 

on whether the system include the o&n o f Z 7 " ° f 
to the database. ,0 " ° f rejecl,n « >"P"ts that are a poor match 

1. No rejection ability. \ n the simnl. 

- tested ^ on i^eTof pL^tt trt" °* «« 
-or here is a sn to(; ,,„ ^ ™* ' The P*-** for 
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test set: people in data base 
test image 



test set: imposters 
test image 




reject 

correct 

true reject 



is^ssss^sSA^ 

two test sets are listed above. 

of inpu ts correct,, recede, and the snbstitntion rate, the .action of inpnts 
falsely identified by the system. 
, Poo r m ^ r+UL H a nation ^^^^ Z 

shown in Fig. 2-1, '^^"Ji l groop of impostors. For the test 
testing data, a gronp of database people and » ^ J rejectio „ iate , 
group o, database people, we system . Thus , the 

tace mistakenly recognized as faces from the datable. 

One can tradeolf the region ^^^^^ ^ 
rejection criterion is. Using a st„c« on nUn« ^ ^ ^ ^ 
.ejections, as inpnts now need to b a to er m substi tntion 
Conseqnently, the false reject.on rate w,H m££* 8 ^ ^ ^ 
rat e, and false access rates are ^ , good . However , 

rejection criterion more stnet, the false access ^ ^ 
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images of the person to identify tie Z ^""^ !" Syslems «» g™b several 
^tern has sevL attempt ^ „ ™ r t T «» 
application sueh as humaLomnu^ t , " and ' f ° r a Iower Verity 

cesses. Further, the userlTno ™ « tit ' ""7^ ™' h felse ~ 

libera, rejection criterion should be Jsed ""^ * » * ™« 

^ , X3£Sl£2r , f - — — * 

a database of 42 people To test „ ' mPreSS ' Ve 1W% Titian rate on 

•08 faces from onlide the d LTwTt rf H °" ™ ^ « 

»J Poggio's template-b 1 ? ^ ° f ° % ' ^ 

views of 47 people The svstem of r "cognition rate of 100% on frontal 

People and Lhed 1 rl ^Zn T 'f ""^ " ' 
ignition rate .hen theirs",,^ Jl ^ ^i"* * » 

under varying lighting condiLns. A wZ^j] r^Tl WW "y^'k ^ 
smaller size database of nr.lv 11 it, % reco S ni ^on, but on a 

report a recognitor 0 \% ^IIC f "T^' ^ 
Bichse,[20J give plots of recogmln r at .^T^T ^"/T^ ° tS "' 82 ) 

rate also climbs to unacceptable levels. ' lse access 

do not necessarily scale ^J^^ZT ' ^ °" 
m face recognition have been iJL tft ' ^ stu<te 
faces have been nsed. WhaTtWeTsno ° f tha " te " m <*' 

database, some of the J ^"^^M M iST l T ^ 
on the order of 70 people or more P™,l jT, ? , J ' ' " ' " Sed databases 

database is being collected under the Army F ^™^ sof — 3,000 people. A face 
of Angnst, ,994, the database had oL Z ^-ZT °" V M 
If we take a cue from the potential app icaZ ^ ^ PCT PerSM ' 

database size, databases on the o der rf 1 0's a f ™ g " ition considering 
building access), bnt other applies fa ' , T " T (e * aU '° mated 
1-ger databases on to o^Z^£^*^^ ^ 

t bere:s:,r; s ttm^:r f issu v s the variabi,ity ° f th * ~ - «■« 

recognition system, exp" ^ J 1 .T" ^ * to 

P ments should jdeally use a vanety of test images per person 
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samphng Sm aU cWes in Pose ^^^Z^Zl^ 
varied in this regard. [82] use, 50 tes mg ™ P ~l wn , left , or righ , The 
session where the subject ^was asked o ro We h, h P ^ 
database of vou der Malsburg aud «*^ I ^L, Peutlaud, Moghaddam, 
rotated 15" to the right and v.ews w,th vaxymg expre ^ 

'person covering different rotation angles on the v.ew.ng sphere. 

2 1.4 Related work 

Whi,e our discussion of existing work has ^^^^Z 

"X Wu and Huang[140], 

S^S^JS°SSS5* - and Mi.ios,85,, Nagamine, Demura, and 

Masuda[99,, Gordon(61]). background, making 

,„ the profile view work, the face » ^ J*"™^ a low dimensional 
detection of the profile simple. Next t e profile ^epes^ y 
(1M »D) vector of features extracted ta-fl- ^ ^ ^ 

are autoeorrelatmn coefficent o Jta b nan, locateQ fiducial 

a scale space analysis f , segment the profile^ the face 

I„ the 3D range work, a depth map ° f the ,a * k feature uti , iz e u by 

with an active depth sense, Paee s « ™;^J„ re J ctor , the components 
th ethod, [85, cover, ^J^g^ the profile, .n [86,, the 
of which measure distances between matc hing convex feature regions 

tended Gaussian image is used for ^ge data along curves 

of the face. [99, ^^^S - we,, as circles, are 
of i„tersect,on w.th the 3D data. Hoi z OTrva toi. features 
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2.2 Synthesizing faces 

problem of pose-invariant fJZ^Z I , ^ °' ' hiS ■"»— «« 
available. This problem is redu'dTo 1 I Z * °" ™ W ° f ^ " ! ™ * 
. tua) views of eaeh person That is the 7 I , hy s W th <*™S ™- 

views will be used as exlple v ews „ TT"L °' ^ ""' M<1 ™ ,tipfe ™.ual 
views are synthesized nsingpro Zl ^ f 7 ^ ^ ^ ^ 
can one represent this y£Tj£^tT*» * ^ <*« ^ 

h^lt^ 

^^T^^t^ "■ W ^ taa ^ - - to 

in the co mp „ter graphk a 7d cZ ^ eXPreSSi0 "' ™ S hs be » 
h M dwidth P teleco g „fet in" (Zl 7Z^ * — 

Chen, and „sn N , Ahimotl's" a Wal 1,"^ ^ *"* 
^',H01J,Waters add Terzopoulosfm [1 M]^^ ^■"1f , ' pep ' i - 0fa ''-- 
Harashima, and Takebe[40] aid Wi ml II 4lV 7£ ""if™' and Sai »W, Choi, 
shape of the face is reprLiw ZT f " 3D m ° de ' ing teC " ni< "-. ^ 
multilayer me sh that siuTatt tZ Afte ^ "T" " * * ^ C ° mplkated 
mapped onto the 3D m ode el J o f t "e IT" / / iS ^ 

generated by rotating the 3D mode, an *~ «"» 
mapped onto the 3D model in one of t, J " ng '° \ 2D lma « e - Fac « are texture 
facial features in both the L " Tnd 3D 7,' ^ «™P™<tt* 

color image data simultan eoT fb^i g ££££ T""? ^ 3 ° ^ »" 
from Cyberware. 8 Spec,aJlzed cqmpment such as the digitizer 

^;,~:thr zzt mMto r ,y *« ° f «- 

method or a It r g facil e™ ^ ' *° ^ ^ expression as well. One 

hanging expression deforms the3Dm l odi' mimJ^ZZTT, *" 
a mnltilayer tissne model, translate exnr J, , T ' ^ 3D with 

simulated in the tissue and deforn th T """^ mOT< ™nts, which are 

s,stemso fWM d [4 o,rd d t h: f rm f \?r-* i - d » g 

•by moving vertices in ways that mimic or Cf^l 1^7 ' ^ ^ ^ 
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*• ^rvr^rWs are designed with teleconferencing or 

the ac on umts ed , ^ ^ ^ or ,.„ p 

different expressions by hand tweaking the 3D model. 

2 3 This thesis and prior work 

of the ,mage plane. For each person the viewing sphere. The 

using a view-based approach that sampled plat ^ base d, thus building 
"."■^^nS^to^^E frontal views [U,[28l. 
^onTfortT^ 

synthesis uses simple 2D warping operators. 
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Part I 

Face Recognition Using Real 
Views 
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Chapter 3 

Feature detection and pose 
estimation 

independent featnre finding and pose es mat.cn ^modn ^ ^ 

duction, the kind of ^ ICarensed to bringinput face, 

at least one nose featnre. The locat.ons X ample views. Pose estimation 

into rough geometrical alignment w,th the databas ex P ^ ^ 

is „ S ed as a filter on the database v,ews, peering only ^ rf ^ 

similar to the input's pose. By pose est.mat.on we reahy ^ ^ 

rotation angles ont of the image plane s.nce f a ur » s is rea]ly 

existing work in detecting facial features. 

3 1 Previous work 

featnres snch as the eyes, nose, month, and face on ^ rf ^ 

addressed the issue of ^"[^ZZZ efforts are motivated by the 
model fit to the featnre. White most featnre detec ri applicat ions 
need to gecmetrically normahze a face ,mage pr » to g 

of facial features include face tracteng etectmg f ^ parameterized 

Most research to date has taken oneof ^ « . nterest operators . In 
model approach, a pictor.al approach and he n^ o U ^ ^ 

one parameterized model approach ^ Ms ^1TcZn,\ (Ynilte, Hallinan, and 
features are fit to the image by mimm.zmg an energy funct.onal ( 
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Cohen[144], Ha]Iinan[63j, Shackleton and WelshfllRl H Jm , 

deformable models are hand construct^ Wdsh(118) ' Huan « md CheD 69]). These 
features such as the iris oH li Z T ^^'''^ ««• that outline sub- 
of the mode, to ^^ itZn ^ " ^ 

fitting is performed by min fa 1™ ^ 1 TT a^,' VaikyS ' " " d m ° del 
fits a g ,oha, head mode] ZZ t d £ I of f fl « 
Craw[15], Craw, Took, and Be„nZ 7 Colt ° f , fe f (Bennett and 
ing individua, feature ,^^1^' Waf 7 Th " ^ 
con our mode! of snakes to track faeia, features in ^ "* 

a^i^ 

(Bfehse I[2 0,, Baron[n ' ^X^^^ * 

composition following the ehrenW „ , BrUnell, ' 28n - a " "e'gentemplate" de- 

the weights ofhidden ayer oI™r„tt°" TT" " [1 ° 3 "' " 

For the template-based systemT or " t (V '" Ce "'' WUto " d *>»[1M|). 

the typical matching metr The ! ,°" PreprMessed «*- of the image" 

- space" metri * mel e Xl 7> ^ ™ * 

and its projection onto tl ZrtZTJ n ™ * SMm ^ ™^ 

*™» a network where - 

negative examples. opiates are learned from positive and 

base'dtLir;:";^ 

P-ches, this appro aLoet, 2 f£ ^ >l 

-se or mouth detector does t ad he W ^ » ^ 

structure of the image, such as dmer, ,A t " ^ * fe vel 
and Ye S huru„[n5]), o thtUmh • "; yeJm '' " ^ ' 81) ' s ^™try (Reisfdd 
lappa, and von de Ma sburg^ tt h I Zl 

of the image. *' ^ eXtraCt<id f ™> a wavelet decomposition 

3.2 Overview of our method 

in and out of the image p ane T " a " g ' eS ' Whkh indudes Nations both 

issue. As Just mJ££££2Z^ '° ^i 1 "; ^ addresses this 

content (i.e. the eyes or nose i L^t U 8 ^ k*"" 65 

to fall into one o/two c^T^S 7' * T T' '» d 
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*. fir,, tune the models. The amount of work is 
ad hoc and retire "P- 5 ™^^'^ a, model, and fitting rn.es for 
m a„ageable for one view W ™^™ Sloped. Thns, we chose to explore a 
different views on the vewmg P e - d P^ fa ^ 
template-based approach for onr featnre P ^ 

To serve as the front end of a pose mdepen -"ecog^ ^ ^ ^ 
m „st, of conrse, handle varymgpose an ^ erof ^^^^enfrommnltiple 

poses and from different people. To ^>"° ^ , from di ff ere nt scales 

from different views on the vewmg sphere are a « . P ^ ^ 

a„ d image-planerotations - grated by sm s ^ ^ ^ 

operations. To make the featnre finder person P ^ ^ ^ 

=^rzsrr£=== 

correlation. , ber of templates 

Onr featnre finder, then enta,ls ^ we use a 

samp .ing different poses an exem ar , « ation of th e image, 

hierarchical coarse-to-fine strategy « ^eve py ^ ^ ^ 4 refers 

,„ what follows, level 0 refers to the ongmal rmag ^ 

to the coarsest level. The search begms by gene at ^ ^ 

, CT d 4, where the pose parameters are very coarse y P ft ^ 
is nsed. Exploring a level 4 >«»^3^ the pose parameters are 

finer pyramid levels. ^ P-essmg ^-ds to h ^ ^ ^ ^ my 

where the correlation scores are not entirely ■ ^ ^ 

switches to depth first . Search a levds 4 « ^ ^ ^ ^ , y 
hypotheses are generated from all level JV q{ ^ 3 nypot fi eses . 

sc0 re. Then the search « ^c^ « a de tfrff 

threshold tests and the final answer is circled. 
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hierarchical coarse-to-fine 



approach 




subset of templates 



search in pose space 
- expand hypothesis 
and test 



full set of templates 



verification 
- register and test 



survives 
correlation tests 

Wn g generated and L sort d Th sorted ^ ff** 3 

depth first. The first level 0 hypothe i t ' * « ««, expanded 

returned h y the system. T^^^^T™,^ ^ - 

a.Ly 0 ;rhi^ itr^wXs: 
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3.3 Hierarchical processing 



Zl - foUows. At the coarsest level >ve 4, jta • ^ tes aie 

o the overall potion of the face, so a U £ o ^ ^ ^ , eve „ ^ 
correlated over the entire ,mage. 4 pixels - the pose parameters can 

coarse - the interocolar distance ,s ^ Cuirently the system oses 

be sampled very ^' ai .™£ 30) three image-plane rotat.ons (-30, 0 

3 and 2. . , at levd 3 or 2, pose space is explored 

When a pose hypothesis is bemg refined at ^ rf 
at a higher resolntion in a small ne.gh borh-d -o ^ ^ 

L previons level. At level 3, for ms Unc the 5 lef ^ ^ to ^ „ umber 
expanded to inclode 3 op/down — ^ J image . plan e rotation paramete 

^—s. ^^S^TE-. estimate of the 
L»re is explored in a small neighbor n se archmg over 6 

pru. - * 4 t r oth :;:t;:: x r;; client ^ 

"p/down rotations, 3 "riation maxima. Pose hypotheses from 

6) ip a neighborhood aroun th< ^ « ^ ^ ^ image at that pose 
1 vels 3 throngh 0 keep track of how P ^ , f ^ temp ate 

For each of these leve, 3 % -lotion of image-plane 

correlation is above a certam threshold- ^ J ' a toU , of 15 rotations rom - 

the two irises and a nose lobe. 2 ^ done t0 increa se the 

The repetition of the np/down ^«* t ° 0 ^ a cno ice on the np/down 
flexlihty of the search - it » T t 
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pig " re3 --^- plateoftheeyesandnosemedbythefeah 
""^^^^'^^^ 

cation steps. Level 2 hypotheses P ° 0 vi/eXZ]' J ' ^ °" 

and the finer ,e vels use lhe -lat.velygoool esti m ates of feature l„ calions 

™-::l~;t:r;ht 

3-4 Template matehing 

^-*r?j^ri^-— • , 

toted to square regions Ar.,»1 . , g % ^""^ Certain features not K • 

°! » Ml face terop! « L ™ H TV" ge ' a ^ -tirnate 

the eyebrows to below the chin At fi f d ' lem P'ates that ran f rom ahnv 

^tes that cover smal^ ^ ^ *— — 
rt J' a° bMgS vs ' -«> fc » 6 s in the 1, ^P' 6 '^P 1 ^ At 

F'g- 3-2 handles cases where where baL j " d temp,ate f ™n the left in 

do not come down lo the eyebrows At level th Where "* 

"-d, bnt the template is broken „ p fn to two 1 T ^ *< ^ 3 afe 
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M onnd the iris center or nose lota featn- ^ th blem - 

The correlation ^"'"^I'Xeps Jack of the different exemplars. For 

» particnlar exemplar eye or nose feature h ^ ^ For a pose 

that all subtemplates of the eyes ^J-J^ be some com binatio„ of pass.ng 
hypothesis to pass the threshold,^ test^the ^ same 

e and nose templates; the P-^^, exera p tes increases the flexib.hty 

only exemplar B will still be allowed- on processed 



<TI> ^T><1> 
a(T)cr(I) 



1 ; • ^TI> 

i, the normal correlation of T and I <> s ^ ^ ^ system some 
st andard deviation. We hope V, range 0 f the camera, » the ,mage 

tavariance to lighting cond.fons an fte dyn ;> ^ t on 

me an and standard dev.atmn are » ™ ;„ to provi de for some mvar.ance 

preprocessed versions of the , mage and templars » rf the gradient , the 

lo hghting. While we have expiore he * ™ V ^ ^ stooJ out the 

Laplacian, and the origmal grey preproce ssings and then snmmmg 

v^xsz - — - - - *• 

^higher resolntions in the ^^^^^ 
This might foil the matching proces b ca ^ ^ ^ h 
cisely match the templates dne to d.fae e features> t he 

feajres in the inpnt may not snffic, n«y <ta t » ^ template mod , 

•,„pnt featnres may he froma nove ^ correspond ence w,th he 

viLs. In order to bring the based on optical flow to "warp" th 

templates, we apply an ^° mplate , First, optical flow „ measnred 
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template 

:e KH 

m 



input 1 ) ^pute 

pixelwise template 
{ correspondence! 



3) warped input and 
template in 

2) 2D warp input COrres P°"dence 



warped input 




feature, as shown in Fie <M ; 0 tk 

f ^^»P-t^u„^I'^^ i ™^ u^g the flow fie.d t „ 
or snoall rolationaI and ident i ty . re S d ' ^f 46 - This W Ps '° «»P««te 
'^Plates. Correction is perfor ^'X * ™«s between the input fealures ^ 
R*al feat™ locations are determined t ^ 
depth first search. rJ^J™'""*' 'eve, 0 m atch returned by 
which are ma„„a„ y loC ated i„ tfi^X ^ ° f % iri « 'he nose ,„bes 
" W .nput image using ^ P 1 ^ ^ corresponding points 

features located in see exa mple test imi Z '™ 0P f " Cal Fi «' " 'hows the 
correspondence ftom optical flow is den^ Je c!n Id "TT"' *° ^ be ™* 
featee points once we have brought „„" eye „d ' ^ m « < h ™ 'hree 

wth he oarage; al, we have to do is man Tal v S n , " ^ ! "'° ^spondence 
templates. We stop at three points belo e ha fl 7" ^"^ " the ^nrplar 
- — » - b y the geonnetrica, ^ ^ L^"* °* 
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test images. 

particular test run, let <U. "> f ^ UK finde r outcomes were 

Us manually chosen locat.cn. Four ddt ^ bad {i J , ,..,) 

j /j ^ t onri), marginal (t 90 od b . . w hose t d to be aDoui 

^ „ trno fett found; al, hypotheses re ; ec e ^ ^ ^ ^ ^ 

^ of the database, t h ^y ^ ^ w outcome n 0.4%^ 

margi „a, °— ^ % of th e good and ^,3 pixels, 

margmaloutcomesaresufficen maJ0rity of the test ,ma 6 es. 

sotherecogmzercanberuno 
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In most of the error case, th.r 

35 Summary 

I" this chapter, we presented a pose anH 
The next step in our view-based f a 



Chapter 4 

Face recognition using multiple 
views 

been qu Uesoccessfol o„ fronta v,ews to e ^ ^ [59 , . 

being a good example (Baronll 1, f "" 6 W(J systems to handle vary.ng pose 

,„ this chapter, onr goal is to "t^tob representingfa£e s w,th 

notably facial rotations ,n depth. Our PP o ^ [5 v]evls 

templates from many example .mages t h at » ver and 

pers on shown in Fig. 1-3- I" ch p^e we j« yiews peI ori . 

- fM ta recog " iti0 " wi " ta disc " " 

"^ontlineofonv^^ 

=Co:tTeir:! T d = 

As shown with the thick grey arrows, the on fine p 

the geometriea, registration and ^^K*^ 
To recognize the person in ^f'^J^^ finder from Chapter 3 
in F ig. 4-1. First, the person- and P"^"^*™ esti mate from the 

ocltes the iris and nose ^J^^^, selecting 9 of the 15 

views. Then ^^"^^Lrtog the two and then correlatmg example 
each to the inpnt by Finally, the recognizer reports the 
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FACE RECOGNITION USING MULTIPLE VIEWS 



off-line procedure 



on-line procedure 



input 
image 



feature 
detection 




4-1 Off-line template extraction 

In the off-line preparation of temola. « th* fi . . • 

of each pe rson . As discnssed ^ f P ^ " ^ ™^ 

are taken fa. each person by fixing „Tec e If „ *' a ™ if °™ sel ° f PoL 

desired poses of the 15 views are" „ cale T* ^ 

»W to rotate their face to point ^ ^ **> -™ » 

example views, a set of facial features are m ^ ,°° e d °' S - AftCT taki "« 
bounding hoxes for the tempiat s TZf? 'T" °" ^ '° «^ «« 

nose lobes, and corners of the month I I l " 8 ' mcMe «» "-s, 

denoted P » 0 < , < 5 . The f ™ » °™ ™ 4-2 for an m3 view and are 
set iodudes more features than are returned bv ! ' ^ fe *'» re 

The next step in prepariM r n ? aUt0n,a ' ic feature fi "der. 

pomt s.mdarity transform to place the irises^t a „ u , ^ aPP ' ying a h '° 



OFF-LINE TEMPLATE EXTRACTION 




of the mouth are manually labeled. 



to the original example image img 

img sim (p) = irng(T(p)). 

■ x. +~ facilitate remapping the example image 
Tll e direction of this mapping ,s ^J^XL a grey level value 

img nnder the similarity transform: a p.xel p m 9 „ m 
from .m S (T( P )). The similarity transform has the form 



/ cosfl ■""Md+I'M 



wh ere the scale , image-plane rotation and 2 D — (,,,) are fonnd h, 

solving t(p„) = p?, r(p,)-PT. 

«• - - - -si rr.:™ stsssass 

p^t-'IpD. 2<-£ 5 - 

,„H relative to a coordinate frame denned by a 
The Pi feature locations are measured relative to 

...hole face template" which will be for scale a „ d im age-plaue 

Aft er geometrically normal* the example ,m^ ^ ^ ^ 

rotation, bounding boxes are defW for the eyes ^ ^ ^ ^ ^ 

p,acement and sizes of the bound.ng boxes are det, raeasurem e„ts 
few geometrical n—n en. on the hown, ^ ^ ^ ^ rf 

include the interocular distance d, tne ve 
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VIEWS 




feature 


bounding hr. x 


| left eye 


upper left 
(xo - Ad, 

y 0 - .3d) 


lower right 
(*o + Ad, 
2/o + .3d) 


right eye 


(*i-.4rf.) 
Vi - -3d) 


l*i + Ad, 
Vi + .3d) 


nose 


(*a-.ld, 
(» + «»)/2- .7n> 


(*s + .lrf, 
_Jyi+^)/2-K3n) 


mouth 


(2:4 - .lm, 

_JI^4 J ys)j^3^ 


(is + .lm, j 



of the bounding bolt' J; 3 q v !? 1 °T ^ «» 

*- ™ ^ ~ t::r; rr* 4 " The 

experiments that test Jto rS^X^i ^7 '° 

cessing. The preprocessing filters Llude to , ^" °' W ' <" e ^ 

"agnitude |,W,|, lhe , J d „ /the gradient 

was varied b y changing the distancTh^ ^es"" "T^ ^ 
P. m the transformation 7\ Three different in. T IOCati °" S "° Md 

15, 30, and 60 pixels, with the ktteT h d Were ™ ] "^- 

(«*«,. ToavLpr'oWeLttlT.i XtheTJ d^ ^™<^ • 
■mage was smoothed before downsamolin, Th P ^ t,,e eXam P k 

-ale will be described in section H «0*™sat s with preprocessing and 

section 4.3, the experimental results section. 
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/ d x I . dyl IIV/II dxxl+fyyl 



Figure 4-4: Feature templates under different types of preprocessing. 



d 




d = 15 d = 30 d = 60 



Figure 4-5: Feature templates under different scales, as determined by d, the interoc- 
ular distance. 



At this point, templates for each example image have been created for a variety of 
image preprocessings and scales. As shown in Fig. 4-6(a), let us denote the individual 
feature templates as tempi j,0 < j < 3. These templates form the basis for the corre- 
lation step in the on-line recognition procedure in Fig. 4-1. Additional information, 
however, is stored with the templates for the on-line geometrical registration step and 
is shown in Fig. 4-6(b). First, to bring the input view into rough correspondence with 
the example view, the feature points p, are stored. To drive the second part of the 
geometrical registration, a fine, pixelwise correspondence step, a grey level whole face 
template face-tempi is stored. The feature locations p t are defined relative to this 
whole face template. The use of the p, and face-tempi will be described in the next 
section. 
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templ 0 templj 



face-tempi & p. 




4.2 On-line recognition algorithm 

"»-.ine" g pro ced„ re ~ g 4- 2 f^TT^l°™* e di *» f " < te 
is given in Fig. 4-7. Pseudocode ? ketch,„g the steps-of our -recognizer 

m atch: The It c: e m t : dat ; b r md returns the ^ 

input a v.ew img input of an unknown person to identify. 

1. Templates, tem^ [person] [view], 0 < j < 3. 

2. Feature locations. p t [person][view], 0 < i < 5. 

3. Whole face template. face-temp{ P erson][vie W ) 
output the closest matching person in the database. 
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Template-based recognizer 



(1) Po, Pi . P2I3 <" feature findeF ( im9i **«) c - . 

2 selected views <- left or right group of views, from feature finder 

pet 7 1 to NUM PEOPLE /* for all people in database / 

S for" view 6 se.ected views /* all views to search */ 

(5) geometrical registration 

affine transform: img a// (p)<-^5m PU A T (P)) , 
optical flow: (Ax, Ay) «- optical-flow [face- ^person^ew], .m,.„) 
y) 4- im 9af j(x + Ax(z, y), y + Ay(x, y)) 

(6 ) correlation ^ ^ ^ ^ mouth */ 

^^[^[view]^ non»^^(^.^ j D^K^ 

(8) return arg max score[person] 



Figure 4-7: Pseudocode for our template-based recognizer. 
Toserveasarunningexa^ 

• i. tor, The feature finder, which is described m Chapter 6, 

rr^sSTw^ " « <* - ,obes The r tio ; 

input image in Fig. 4-8. Wure finder is used in 

The left vs right rotation information provided by the leature nnaer 

S J J W Nter the — views. The left * ^^^t^^ 

provides „ estimate of the ^^^ ^ZL « similar in view 

the left three columns or right three columns of Fig. l-i 

selected-views (left) = { m 1 0,m9, m 8, m 5,m4,m 3 , m 15, m 14,ml3) 
selected-views (right) = (m8,m7, m 6,m3,m2,ml,ml 3 ,ml2,mll}. 
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Figure 4-8 We demonstrate the recognition algorithm matching the example inout 
to test for each person. 

all " S ' t T <3) (4> rcCOgniZW l0 ° PS ^ r «« " example views of 

all people, matching each in tnrn against the input. To record the temnlate T, 
sco-fortheselectedex^^^^^ 

wmch parallels temp^.] i„ being indexed by featiire ' IJ • 
The mam part of the recognizer steDs fll ^ ln\ V 

brines the inn,,t » n A i • ( P ( geometrical alignment step 

zm^ a// (p) = im^^fp)). 
The affine transform has the form 



np) = 



«10 



::)-(!;)• 

where the affine parameters * 00 , a au „,„, .„,,,, and (j „ fou „ d fcy so]ving 

r(p») = P „', r(p l) = p /, r(p 2|3 ) = p/ 3 . 

This three point affine transform essentially models the face a, » „1=,„„ k- . 

the 2D aspects of pose: scale, ima g e-p, M e rotation, and 2D translation. "Ins 
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jmg inj)Ui 



fgce4emp/[person ex ][view ex ; 





p- A q. The features detected in the input image (a) are used to affine transform 
in the example view (c). 

xsxssssssstss^, ... - - » - 

1 he second pa 5 between the affine transformed input img af} 

small remaining geometrical differences between tne 

fFi , 4 -9(b)) and the whole face template /a C e4e m p/[ P ers 0 n ex ][v,ew ex J^ig UJ 
^rig,. t »v L V-' farfnrs such as out-of-plane rotation, 

These regaining differences may be due to " dense ^ of 

(Ax, Ay) = optieal-flow(/oc e -lemp/lpersonJ[vKW e <],img. // ). 

Th vector field (Ax Ay) specifies for each pixel (*,») in the whole face template 
The vector fieW Ax y) P correspond mg pixel in the affine trans- 

a relafve offset (M« 4 n driven by the optical flow , 

r;i: — - 1 « «. P — * - 

whole face template 

in*^..») = «■»*//(« + *4*V).V 

their corresponding pixels m the whole face template, ng 
correspondence process for the inpnt and example ,mages of F,g. 4-8. 
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are now in pixelvvise correspondence. '«>",xl[v.ew ex J and 

When the input and example are the same person, optical flow generally succeeds 
on d 7 C ° rre r ndenCe CM C ° mpe " Sate for nation, cale and ex™ 
21 , Wee r ^ tra " SfOTmed » d — ™age. When 7e 

npnt and example are from different peopie, optical flow can fan to find cole "or 

reject the ^ ^ "°" *« « *> 

van Doom [79] and Shashua [119] [120]). ? ( Koendennk and 

Now that the input and example images have been geometrically registered in step 
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each example template is correlated over a small region (.* 5x5) centered around its 
expected location in im Sm . Normal.zed correlation is the matchmg metr.c 
_<T1>-<T><I> 

where T is the template, I is the subportion of image being matched against, <> 
U th mean operator, and aQ metres standard deviation. As ment.oned _ m the 
introduction, normalized correlation provides an invariance to grey level sh, ts 
he temp.ate T of the form »T + 6, where . is a ^.^'T^Z 
an additive constant. This Kind of invariance may provde rmmun.ty to d.ffer ences 
between template and image in the overall ambient lighting level or camera contrast. 

When scoring a person in step (7), the system takes the sum of correlates from 
the b«t matching e^e, nose, and month tem P ,ates. Note t at we — 
views separately for each template, so the best matchmg left eye could be from v.ew 

he urn nd max operations - first summing template scores and then max.m.zmg 
over views - gives slightly worse performance, probably because the or.gmal sum/max 

,8) ret nsThe per on with the highest correlation score - we have not ye deve oped 
a r teton on how good a match has to be to be believable. A first step m studymg 
th problem could be to compare the correlation score statistics for correct ma es 
aeainst those for incorrect matches. Considering a task like face venfica mn, havmg 
*e amlity to reject inputs is important and is something we plan under future work. 

4.3 Experimental results 

Our view-based face recognizer was evaluated on a test set of 620 image. Described 
in the introduction and Appendix A, the test set contams 0 v.ews each °' 
When taking the test set, subjects in the database were asked to present their fa^e at 
a series of random poses to the camera, where the pose ,s constramed to he w th nrthe 
overall range of example view poses. In addition , the subjects are encouraged to rotate 
ZrtTin the im Je plane for half the images, so all three rotation p^eters are 
varied in the test set. Example sets of test images are shown m F,g . 1-6 and A 2. 

On this test set our face recognizer basically achieves a recogn.t.on rate of 98%^ 
To explore the effects of changing the template scale and preprocessing filter on 
this recognition rate, we have performed a series of recognit.on expenments, where 
th Z lent ru s through the entire test set of 620 images. The recogn.t.on 
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preprocessing 



dx+dy 



mag 



lap 



grey 



correct 



performance - 620 test images 



>8.71% (612) 



98.23% (609)" 



98.07% (608) 



"94.52% (586) 



2nd place 



0.32% (2) 



0.81% (5) 



0.81% (5) 



1.94% (12) 



3rd place 



0.48% (3) 



0.32% (2) 



0.32% (2) 



0.48% (3) 



>3rd place 



0-16% (1) 



0-32% (2) 



0.48% (3) 



2.74% (17) 



bad featun 



0.32% 



0.32% 



0.32% 



0.32% (2) 



^-l -^L illl I »-^^> U 

from £L T Tf" Perf0manCe VWS " S Preprocessing. Best performance is 
from using the grad.ent magnitude (mag), Lapiacian (lap), or the sum of separate 

Z ":r \ I T d ^T dien \ COra ~ (dX+dy) - iatescTwl 
used, with an mterocular distance of 30. 



rate m these experiments counts errors in both the feature finder and template-based 
recopnfon. That ,s, if the feature finder fails, then the template-based recognised 
no executed and an error is recorded. Our feature finder failed to find the e y es and 
scale T V," T° '! St image8 ' S ° eXPe ™ entS " ith Preprocessing and template 

The Wot of o Un,m "" S reCOg " iii0 " reSU " S f ° r the P-P—sing experiments. 
The types of preprocessmg we tested include the gradient magnitude (mag), Lapiacian 
lap sum of separate correlations on , and y components of the gradient (dx + d" 

"termedZt T t ■' ^ *" - -ed a 

mtermed.ate template scale, an interocular distance of 30. In table 4.1, we list the 
number of correct recognitions and the number of times the correct per on came „ 

nd°i ' "I T third "'^ BeS ' Perf< ™ ™ h * d from dx+dy mai 

wth the'Id t X y ye f n8 f bCSt reCOgniti °" ^ a ' ™*- PreproLsing 
w,th the grad.ent magmtude performs nearly as well, a result in agreement with the 
preprocessmg experiments of Bmnelli and Poggi o[ 28j. Given that'sing the original 
grey levels Produces the lower rate of 94.5%, our results indicate that preprocef ing 
he .mage w.th a d.rTerential operator gives the system a performance advantage w 

hit one P M d ' fferenCeS betWe ™ dX+dy ' m * ™ d » -all to say 

that one preprocessmg type stands out over the others. 

Table 4.2 summarizes our recognition results for the template scale experiments 
where scalers measured by the interocular distance of a frontal view. The preprocess 

that at least for our mput representation, the coarsest scale may be losing detail 
needed to d.stmguish between people. Since the intermediate scale has a com/„U- 
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interocular 
distance 


performance - 620 test images 


bad features 


correct 


2nd place 


3rd place 


>3rd place 


15 


96.13% (596) 


2.26% (14) 


0.32% (2) 


0.97% (6) 


0.32% (2) 


30 


98.71% (612) 


0.32% (2) 


0.48% (3) 


0.16% (1) 


0.32% (2) 


60 


98.39% (610) 


0.81% (5) 


0.16% (1) 


0.32% (2) 


0.32% (2) 



Table 4.2: Face recognition performance versus scale, as measured by interocular 
distance (in pixels). The intermediate scale performs the best, a result in agreement 
with Brunelli and Poggio[28]. For preprocessing, separate correlations on the x and 
y components of the gradient were computed and then summed (dx-f-dy). 



tional advantage over the finer scale, we would recommend operating a face recognizer 
at the intermediate scale. 

One additional experimental question to ask is how necessary optical flow is to the 
geometrical registration step. To evaluate the impact of optical flow in the recognizer, 
we removed optical flow and tested the recognizer under the different types of prepro- 
cessing. Fig. 4-11 shows the result, where the light bars indicate the original result 
using optica] flow, and the dark colored bars without optical flow. Template scale for 
this experiment was fixed at an interocular distance of 30 pixels. As evident from the 
bar graph, excluding optical flow results in a drop in the recognition rate of roughly 
3% for the original grey levels to 10% for the Laplacian. This difference between the 
Laplacian and the grey levels may be due to the fact that the Laplacian uses higher 
frequency information, which may make it more susceptible to slight misregistrations. 
Overall, these experiments show that the optical flow step does indeed improve our 
view-based recognizer. 

Getting back to the results from tables 4.1 and 4.2, consider the errors made for 
the best combination of preprocessing and scale: dx-f dy at an intermediate scale. Of 
the 8 errors, 2 were due to the feature finder and 6 were recognition errors. In the 
one recognition error where the correct person was not even among the top three, 
the correspondences from optical flow were poor. For the other errors, the correct 
person came in either second or third place. For these false positive matches, using 
optical flow to warp the input to the model may be contributing to the problem. If 
two people are similar enough, the optical flow can effectively "morph" one person 
into the other, making the matcher a bit too flexible at times. 

The problem with optical flow sometimes making the matcher too flexible suggests 
some extensions to the recognizer. Since we only want to compensate for rotational, 
scale, or expression changes and not allow "identity-changing" transforms, perhaps 
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Chapter 5 

A vectorized image representation 

S ent faees, a representat.on that proved to be ^ ^ a heav . 

.he recognizer. However, the second half ^ the ^ for genCTatm g 

ier burden on our face represents. Our ^ ordered vector of ,m- 
'^al views use a «uW P These features cau run the 

age measurements taken at a set of fac al fe p ^ ^ rf fte d 
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5.2.2 Optical flow 
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a s eq nence of image frames cont.nmg ™™ g °b * .» d a ^ ^ 
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- — - ■ — 

at time t and t + 1- the computation of optical flow is 
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_ 87 r _ |I j = = 

, , -— fl,.> ' rfl 

If we divi' 

then we get _ (5.2) 



vide through by St and make the substitutions 7, - '„ „„• 



- - -r, * # strs — sronhe 

solution of optical flow. h used a hiera r- 

While many algorithms exist for e— ^ ^ and Adelson [16] , 

chical, gradient-based algorithm (Lucas and Kanade l8 U], g 
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Bergen and Hingorani [181, Bergen el al [1711 «• r , 

exact solution of equation (5 2) at 'a lot V 1 " " ™ X ^ M 

minimizing the quadratic error ' ^ ""^'^ * is found by 

min(u I x + vl y + I t f 
squared error term over a region R § ' P S) Md thuS inte S rate *he 

m '^^( u 4 + v i y + i t y. 
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Pressing begins at the coarsest level Re fcL tie ° "V* *" Md 
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scale. As processing moves to hT next fi , ""<*P™de„ces at the reduced 

is used to compensate for the motion Ke, If * ^ lhe pre ™ us 
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residual flow J he compute u i" I ah" ? U " d « Thus, the 

to the flow estimate from'the pZu vt ^ ***** -d added 
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Getting back to the original problem of computing vectorized shape, we can apply 
the optical flow algorithm to the problem of finding intraperson correspondence. One 
image of the person is chosen as a "reference" image, and other images are vectorized 
with respect to it by computing optical flow. To assist the correspondence process, 
the two images are registered to compensate for any differences in 2D translation, 
scale, and image-plane rotation. This registration is performed using a similarity 
transform to align the eyes of the image being vectorized with those of the reference 
image. The eye locations, specified by the center of the irises, can be automatically 
or manually located. 

A key limitation with this technique, however, is that the optical flow algorithm 
will fail to find correct correspondences when the two faces are dissimilar enough 
in appearance. For example, finding interperson correspondence often breaks down 
when the two people are from different races. Even with intraperson correspondence, 
correspondence will fail if the images are separated by a large enough rotation in 
depth. Thus, when covering an out-of-plane rotation of more than several degrees, 
we use a sequence of images at intermediate poses to assist the correspondence process; 
this is described in more detail in Chapter 7. 



5.2.3 Face vectorizer 

Our face vectorizer, to be described in detail in Chapter 6, is an automatic technique 
for computing both the shape and texture components of our vectorized representa- 
tion. While optical flow is used as a subroutine in the vectorizer, the vectorizer is 
capable of handling the difficult case of interperson correspondence that sometimes 
foils optical flow. What makes the face vectorizer superior to optical flow alone is the 
explicit modeling of the texture component. Texture is modeled using linear combi- 
nations of example textures as in the eigenimage approach to face recognition (Turk 
and Pentland [129], Pentland, et al. [103]). 

Here we briefly sketch the process of vectorizing the shape and texture of an 
image i a . Shape is measured by finding pixelwise correspondence y^-std relative to 
a standard face shape. This is done by finding the optical flow between i a and 
a "reference" image at standard face shape, an image that is produced from the 
texture model. Standard face shape will be defined as the average of many example 
prototype face shapes. Face texture is modeled using the assumption that the space of 
face images is linearly spanned by a prototypical set of example face images. That is, 
the texture of i a is modeled by taking a linear combination of a set of n geometrically 
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normalized example prototype textures t Pj , 1 < j < „ 

t.=gft-tp.. (53) 

-•asasssfesassssssss 

1: r c — ** — - *■ - •■• — 

where Axf_ rf and Ay- <rf are the * and y components of y- This assists th. 

tonzatjon procedure iteratively solves for a flow (Ax s " Av s " . T « • 
/?; that solve l^ x a-s id) Ay a _ <w ) and coefficients 

'■<* + **£«(«,ir),y + Ay- „(,,,)) = ± Pjtpi . 

both C rZ n boTh Ce betWee " tW ° arbi ' rary imageS CM ' hus b * fo ™<i * vectoring 
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forward warp 
IE 

backward warp 



Figure 5-4: A forward warp moves pixels from ib to i a using the shape y b a _ h . A 
backward warp is the inverse, moving pixels from i a to ib. 

5.3.1 Warping operators 

2D warping operations locally distort the arrangement of pixel intensities in an im- 
age by following the correspondence vectors in a relative shape vector y£_&. Using 
nomenclature from the computer graphics community, we define two types of warping 
operations, backward and forward warping (Fig. 5-4). The difference between the two 
lies in the direction that pixel values travel between shapes y a and y&. A backward 
warp "retrieves" pixels from an image at shape y a and places them at their corre- 
sponding locations in reference shape yt,. Inversely, a forward warp "pushes" pixels 
in an image at shape yf, forward along the correspondence vectors to the shape y 0 . 

Backward warp 

In an example backward warp, let i a be an arbitrary image at shape y a that we wish 
to warp to the shape y^, producing if,. Since the correspondences y£_ & are defined 
relative to yf,, pixels in ib can simply use the correspondence field to "index" into 
image i a 

ii(<u) = »-(q» + yJ-i(*)). (5-4) 

where q& is a 2D pixel location in yf, and q a = q& + Ya-bi^b) is the corresponding 
point in y a . Since in general q n will not be at an integral pixel location in y a , bilinear 
interpolation is used to sample an intensity value from i a . As shown in Fig. 5-5, we 
interpolate i a at the set of four pixels {p 0) t}i = o neighboring q a 

3 

*k(«!fc) = 0)*«(P«,i). 
i=o 



where 



(a, 13) = q 0 -p a> o 
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backward warp 
P, 

y b pixel grid 



forward warp 



Pa,3 

y a pixel grid 



g3 



y 6 pixel grid 



y 0 pixel grid 



Figure 5-5: Interpolation for backward and forward warping. In a backward warp 
pent qb retrieves a value from q.. In a forward warp, first the correspondences y» k 
are inverted to produce y La , as depicted by the arrow from q 0 to q, Then point * 
retrieves a value from q 6 as in a backward warp. See text for more details. 

and 

w 0 (aj) = (1 - a )(l - (3) «,,(*,/?) = Q (l - (3) 

When refering to backward warping in the future chapters, we will either use the 
mathematical form , n equation (5.4) or the notation i b = bwarp(; a ,y*_ k ). 

Forward warp 

Given an arbitrary image i, at shape y,, a forward warp sends a pixel value i„( qi ) 
to the point q„, ,ts corresponding location in shape y„ where q„ = ,, + y» JZ 
Since the point q„ may not be at an integral location, it is not immediately dear 
how to store the grey level value in the warped version of ,, We approach this 
problem by first inverting the shape y Ll to produce y to . By reversing the directions 
of the correspondence field, the forward warping thus becomes a backward warping 
so he previously described backward warping algorithm can be applied. The key 
part is mverting the correspondences. Referring to Fig. 5-5, our goal is to construct 
the correspondence from a pixel q „ to q t given that the correspondences are defined 
in opposite direction. ° 

f„ *Tt* <ilTT ndenCeS ^ 58 S ° ,Ved USing the idea ° f f0Ur corner mapping 

Z " ] ■ T ;r at f0l] ° Wing StepS f ° r ™y *W -urce P^h of 
four adjacent pixels {p 6 ,}f =0 In H . Map the source patch to a quadrilateral { Pa i} * 
m the shape y a i^<m/i=o 

P«.f = Qi.i + y b a - b (qb,i), 0<z<3. 
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For each pixel q a inside this quadrilateral, we estimate its position inside the quadri- 
lateral treating the sides of the quadrilateral as a warped coordinate system. The 
parametric position (or,/?) within the quadrilateral is estimated by solving the 2x2 
nonlinear system for bilinear interpolation 

3 

Qa = £t0f(O!,/?)p« lf - 
t"=0 

using a Newton-Raphson method (see section 9.6 of [112]), where the ty,- are as defined 
in equation (5.5). Note that the (a,/3) position lies within the unit square. This 
position is then used to map to a location q& in the original source patch 

q& = Pm> + (<*,/?) 
and the correspondence field y£_ a has been determined at a point 

y?-a(q«) = q& - q a . 

Processing now proceeds as with a backward warping. 

The notation fwarp will be used in Chapter 7 to denote the forward warping 
procedure. Forward warping image z'j, to i a will be written as i a = fwarp(z6, y£_ fc ). 
In genera], we can push pixels along any arbitrary flow x, yielding the more general 
form of = fwarp (z'b,y^). The only restriction is that the subscript of the image 
argument must match the superscript of the shape argument, implying that the image 
must be in the reference frame of the shape. 

5.3.2 Shape manipulation operators 

In this section, we introduce shape manipulation operators that generate new shapes 
from shape arguments. 

First, shapes can be combined using binary operations such as addition and sub- 
traction. In adding and subtracting shapes, the reference frames of both shapes must 
be the same, and the subscripts of the shape arguments are added /subtracted to yield 
the subscripts of the results: y£ ±t; = y„ ± y£. 

The reference frame of a shape -y b x can be changed from to i a by applying a 
forward warp with the shape y„_ k . Shown pictorially in Fig. 5-6(a), the operation 
consists of separate 2D forward warps on the x and y components of y£ interpreted 
for the moment as images instead of vectors. Instead of pushing grey level pixels in 
the forward warp, we push the x and y components of the shape. The operation in 
Fig. 5-6(a) is denoted y* = fwarp-vect(y£, y£_ 6 ). The inverse operation, shown in 
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Figure 5-6: (a) To change the reference frame of flow y* from i a to i b , the x and y 
components are forward warped along y»_ 6 , producing the dotted flow y° I n (h) 
backward warping is used to compute the inverse. * 





Tv! g 7,?1 : n In fl ° W C ° nCatenation ' the flows yl-> -d y*_ fc are composed to produce 
the dotted flow y"_ a . 

Fig. 5-6(b), is computed using two backward warps instead of forward ones- v 6 = 
bwarp-vect(y^,y*_ 6 ). ' J x 

Finally, two flows fields y« b _ a and y»_, can be concatenated or composed to produce 
pixelwise correspondences between i a and i c , y°_ a . Concatenation is shown pictori- 
ally in Fig. 5-7 and is denoted y c °_ a = concat(y ? _ o ,y^). The basic idea behind 
implementing this operator is to put both shapes in the same reference frame and 
then add. This is done by first computing y°_ 6 = bwarp-vect( y L 6 , y{L.) followed 

* c-a Yb-a r Yc-b' 

5.4 Summary 

In this chapter, we first introduced a vectorized image representation, a feature-based 
representation where correspondence has been established with respect to a reference 
image. Two image measurements are made at the feature points. First feature 
geometry or shape, is represented by the feature locations relative 'to some 

standard face shape. Second, grey levels, or texture, is represented by mapping 
image grey levels onto the standard face shape. 

Next, we discussed three methods for computing this vectorized representation 
for face images. The problem is basically one of finding feature correspondence with 
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respect to the standard reference shape. The first technique was the sparse data 
interpolation method of Beier and Neely [13], which relies on a set of manually placed 
features. Second, optical flow can automatically find correspondences when the two 
face images are not too far separated in pose, lighting, expression, etc. Finally, our 
face vectorizer is S a novel automatic approach that outperforms optical flow by virtue 
of incorporating an appearance-based model for face grey levels. The vectorizer is 
described in more detail in the next chapter. 

Finally, the chapter closed with a discussion of backwards and forwards warping 
operators as well as miscellaneous shape operators such as addition and concatenation. 
These operators will be useful in Chapter 7 on synthesizing virtual views. 
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Chapter 6 

Vectorizing face images 



The previous chapter defined a vectorized representation to be a feature-based 
representation where correspondence has been established relative to a fixed reference 
object or reference image. In this chapter, we introduce an algorithm for comput- 
ing the vectorized representation for faces. Computing the vectorized representation 
can be thought of as arranging the feature sets into ordered vectors so that the ith 
element of each vector refers to the same feature point for all objects. Given the cor- 
respondences in the vectorized representation, applications such as feature detection 
and pose and expression estimation are possible. 

As mentioned in the previous chapter, the two primary components of the vec- 
torized representation are shape and texture. Previous approaches in analyzing faces 
have stressed either one component or the other, such as feature localization or decom- 
posing texture as a linear combination of eigenfaces (see Turk and Pentland [129]). 
The key aspect of our vectorization algorithm, or "vectorizer" , is that the two pro- 
cesses for the analysis of shape and texture are coupled. That is, the shape and 
texture processes are coupled by making each process use the output of the other. 
The texture analysis uses shape for geometrical normalization, and shape analysis 
uses texture to synthesize a reference image for feature correspondence. Empirically, 
we have found that this links the two processes in a positive feedback loop. Iterating 
between the shape and texture steps causes the vectorized representation to converge 
after several iterations. 

Our vectorizer is similar to the active shape model of Cootes, et al. [44] [43] in 
that both iteratively fit a shape/texture model to the input. But there are interesting 
differences in the modeling of both shape and texture. In our vectorizer there is 
no model for shape; it is measured in a data-driven manner using optical flow. In 
active shape models, shape is modeled using a parametric, example-based method. 
First, an ensemble of shapes are processed using principal component analysis, which 
produces a set of "eigenshapes". New shapes are then written as linear combinations 
of these eigenshapes. Texture modeling in their approach, however, is weaker than in 
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Figure 6-1 : To define the shape of the prototypes off-line, manual line segment features 
are used. After Beier and Neely [13]. 

ours. Texture is only modeled locally along ID contours at each of the feature points 
defining shape. 0 ur approach mode]s textufe over ^ ^ _ ^ ^ 

and mou h templates - which should provide more constraint for textural analysis 
In the future we intend to add a model for shape similar to active shape models as 
discussed ahead in section 6.6.2. ' 

In this chapter, we first explore the description of the vectorized representation 
m more detad, focusing on the definition of standard shape. Then the basic coupling 
of the shape and texture computations is motivated, followed by a description of the 
vectonation algorithm. A hierarchical coarse-to-fine implementation is described an 
work " t0 6 deteCti ° n " PreSentGd ' Md We Cl ° se With a discussion of ^ture 

6.1 Standard shape 

Since the vectorized representation is relative to a 2D reference, first we define a 
standard feature geometry for the reference image. The features on new faces will 
then be measured relative to the standard geometry. In this chapter, the standard 
geometry for frontal views of faces is defined by averaging a set of line segment features 
over an ensemble of "prototype" faces. Fig. 6-1 shows the line segment features for a 
particular ind,v>dual, and Fig. 6-2 shows the average over a set of 14 prototype people 
Features are assigned a text label (e.g. V ) so that corresponding line segments can 
be paired across images. As we will explain later in section 6.3.1, the line segment 
features are specified manually in an initial off-line step that defines the standard 
feature geometry. 

The two components of the vectorized representation, shape and texture, can now 
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FiguI e.2: Manned shape, M eave rag eatoco m p U te th esta„ dMd faces,ape. 
be defined relative to this standard shape. 
6.1.1 Shape 

• j could be sparsely defined, perhaps 

The shape vector of an image ,., denoted y , d , ^ ^ ^ 

recording (x,,) locations o , a s ^'f^^^ in the run-time vec- 
Fi , 6-2. However, ^ ^ - - « * 

torization procedure, shape s spat, ally o J ^ e con . 

.presentation for shape, definmg ^J^^^ M a vector field of 
taining the face. The shape vector y < d ■ ia be ing rep- 

correspondences between a face ^^^^^^heo the shape 
resented. If there are „ pixels m the face J ghows ^ ^ 

vector consists of 2n vahre, ^ the 6 grey arroW , espon- 

representahon yff^ for the mage * a (Image ^ 

2D warping operator for geometrieal normahzafon. 
6 1.2 Texture 
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Figu: 
image i std at 



i- «d as indicated by the grey arrow S ^'T^™ " COm P* d b *ween 
a corresponding pixel in , for e^p LTinf ^7" 'J * ™ eM 
of mapped onto the standard shape *" """^ ° f the ^ 

where ^^.y 5 '^ a 

lower -*Xf-^t^"^-2rf' * " " «* 

right. re ™ctor t„ for the mput ,mage ■„ i„ the npper 



6.1.3 Separation of shape and texture 

where the (x.y *) coordinate, ^ deScri P«°" w°nld be a 3D one 

.oca, snrface ^IZ^tZ^t T°" ^ * * « 
men for the m odeIi„g of 3D ob Lets T ^ ^n P t io ,s are com- 

vision algorithms to LJ^^^^ 1 " ^ *» 

images. nagmg or render.ng process from 3D models to 2D 



6.2. SHAPE/TEXTURE COUPLING 

what our 2D — — j- a o; 

p.icitly represent the salient aspects of 2D shap ^ rf standald 
2D representation depends, of course, » ^ ^ ^ ^ „ , ip or eye . 

shape, shown in our case m Fig. tW. »°™ P Howeverj one 

could extend the standard feature set to Mrfou „ ded lri the tex- 

eJ ebrows if desired. For texture, there are non albedo ■ rf shape 0vera)1> 

ture component, such aa lighting cond.t.on an he J ^ ^ 

though, remember that only f One v,ew oftheob ect g rf ^ 

6 2 Shape/texture coupling 

0 „e of the main results of this chapter is that the 

texture components can \^^ e ^To^e^. Likewise the 
more detail. 

6 2 1 Shape perspective 

computing the representation ,s essentral ly a featoe * J Mces . we know „ho 

Consider this correspondence task under* ^ ^ ^ ^ . simp , e 

the person is, and we have pnor example views ot P ^ ^ ^ fte 

:=r:ott ^ * ru„„in 6 . - 

algorithm between the image and ^ vectoAed) th e correspondence 

Tf „ehave no prior knowMgeo the s ^ ^ ^ 
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exL ^ T7 ^J:?Z " ^ k "°" P— - "d have prior 

optica, flow between*, and the IT^ ^LT^" ^ * ™^ 

of the person being vectored, ^3 ,1 " V I "°' h ^ 

t*mg a ,inea r combination ofV^Cilll " ^i™"™ * <•> »y 

^z^^SSr This so,utio - h — *• 

Perhaps based on a confident! « i ^T" ** ^ ""ion, 

Going one step further TtT T cmres PO"de„ce algorithm. 

-omption, commonly made e I « "» 

tection (T „r k and Pentland ^tZ^Z 7^ S tTh '^"i" 0 " ^ 
™ages effaces is linearly spanned by a set „f "fl)', 11 "' "><= *P-e of g rey ] eve i 
cally normalized textnre vector CZ, X ^ ^ ^ gMnKtri - 

li-r combination of „ p^totypl - a 



i = ^ t - (6.1) 



nation t. that is generated by taking a W t eXamP ' e ' sh ° WS a " W™- 
(6.1). If the vectorization proce ™ can Z 7 "°" «*° Xbm " » 
competing correspondences shou^be simp,? Sil^h" "* °' * th *" 
age t. approximates the textnre t„ of , "npnt ' d C ° mP, " ed " referCTCe " im " 
are back to the sitnation where a siml , geometrically normalized, we 
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6 2 2 Texture perspective 

To develop the vector,ation ^^^X^^ 
sirap ,e eigenimage, or "eigen, ace' mo «* * e sp- ^ ^ ^ ^ rf 

eigenface approach (or modelmg face mage h ^ AkamatsU! 

facial analysis tasks, including face reco mtKm Tor fce de _ 

- „<. [51, Pentad - a/. ^^Hi p J land [96,), and facial feature 
tection (Sung and assumption behi „d this modelmg ap- 

detection (Pentland, e< al [103] . .mem by , set 0 

pro ach is that the space of grey level ,mages o. componeiit 
example face images. To optunal y J^> S * Agonal set of eigenimages 
analysis is applied to t « ^ £ «^ faces Me then presented hy the 
that define the d.mens.ons of face sp ^ ^ eigenimages . 

set of coefficients computed hy project.,* the ^ tQ com . 

One requirement on face .rnages, both fort* ex P ^ geometrically 

ponents and for new images P ro ^"^ mage , Most normalisation meth- 
Imaged so that facial features ^ ^^y l aiL transform, to align two or 
ods use a global transform, , rsuallya mu ar , ^ 
three nrajor facial features. For * the eyes mi m outh. 

r atus effectively registers eyes, and ^amats" 1 ^ 

However, because of the inheren ™.ab. * oH^a! g _ ^ ^ 

people, aligning just a couple of features ^-^^ even this normalised 
misaligned. To the extent th "^"^ information with differences in 

but the shape may re q u,re anther hue ^ ^ d 

To decouple texture and shape Ora 6eome trically normals face 

Welsh [118) represent shape -I***™- » defined by fte loca tions 

texture by deforming .t to a standard * ^ P ,„ Craw M d Cameron [451, 
of a set of feature points, as m our d '^^J ^ m used . To geomet- 
76 points outlining the eyes, nose ""f^ ^ |s deformed to a standard face 
Hcally normalize texture -ng shape g ^ ^ ^ the 

shape, making it shape tree . 1 
features and then texture mapping. 
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finding despondences for shape, wh p oba b IT '° "'^'^ » 

need to be located. Craw and C» w ?" ' he OTder of '<™ of featnres 

and Welsh m wh0 foc „ s on e ;-:^ ™^<h"r f-tnres. Shaekelton 
Ynille, Cohen, and Hallinan [144! toToS ** n ™ U ' tem P'^ approach of 

Note that in both of these an™™ K P °° r or "° fit", 

components have been ^S?'* ° f *e shape and textnre 
onr approach, where shape and tex e omo 7 ^ ™ S ^ f ™ 

fashion. ,„ their approach the li k sT r "* in ' etkaVed fa " 
to geometrically normally ,h p„ if ^ '° U ^ent ~ -fog shape 

correspondences is not exploited US '" g 3 <° -* "nding 

6-2.3 Combining shape and texture 

'he textnre vector t„. Key to '^'l™ * approximate 

a mntnally benehcia, m-LwCn^^t™-" *" 
-presentation converges. First, conside tow C e , of ! T ^ "» 
nsed to assist the shape step. Assuming for rt, eX ""' e Step ca " be 

P-ide an estimate t. nsingUZ ' A " £ ™ *> step can 

compnting optica, flow between the i„p„ t L 6 J *** ^ eS "™ te by 

^'St tte th^r 6 . < -^-— *» -pe 

t "W = >«(x + yfj, u (x)), 

^. X hT^™ ritp^iirt^ Sh ^- ™ S step 
when t„ is approximated in th tex r s f h " ™ "* V,- Thns 

sp ™:ii l tir'"^ a ' feature; ^ « u onto the ^ — 

and forth '^^Z^^XP °" P ir ed SyS ' em " itd » hack 
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6 3 Basic Vectorization Method 

preparation of the example textures t Pj , and 
plied to a new input image. 

631 Off-line preparation of examples 

The^s^^^^ 

textures is linearly spanned by a set of y of representative 



faces that wi 
are 

using pri; 
cessing 



Geometric normalization 

i f»„ we aoolv a local deformation to the 
To geometrically normalise an example face w PP y defornia tion 
im i to warp the face shape »*-£^J ^ definition of the standard 
req „ires both the shape of the ex^ple fa« ^ we! ^ for 

shape. Thus.onromhnenormahzat.on procedm correS pondences 
„ ur example faces, something we provide -# ., applied 

„ averaged to define the standard shape. Final y P 

to do the normalization. We now go over these steps ,„ m ^ 
Firs , to define the shape o the examP e fc* a set ^ ^ Beier ^ 
are positioned manually for each. Th efa*» . fa£e images (also se e 

definition, we could have just chosen a oscular examp 
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ch,n ,s generated for the second example. * ""^ " ' hat is whv «■<> 

~l™^ e ~ ° f — - « h next step o f 

segments in the exalp L i^^S^T" t ***" ^ W 
™ g le Pairing ofline segment fol e^Wth'" ^ CMsi<fe a 

the standard shape l ltd . This line seemen, eXamP ' e i ™« e md «« 6" 

from the region surrounding , " L ! " P 3 ^ '™ sfo ™ 
resembles a similarity transform except 2 t the 8 ^ ^ tra " rf <™ 

segment, just sealing along it. TheTo C3 l tl f " °° ^ per » e " di ™'- '° the 
Pair, and the overall warpfng is ^l ^TJ^ZZ ** *"* ~ 

before and after normalization are shown in Pig 6-T ^P' 68 ° f ima «« 

Texture processing 

'he input texture t.asalinearcoml ina& of t b * ^ '** ^ ^™e 
a linear snbspaee such as our facets one ^J™* ht -*~ Of course, given 
vectors that will span the same "oneT T" g ^ ° ! b ™ 

*e eigenimage approach, app, es pld p j P " ""'^ *" fc b ^ 

-t. Another potential basis et is sLZZ l TT *° the 
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components and the original images. technique for reducing the dimension- 

Principal components analyse is a c la*ca tech q ^ ^ ^ ^ 

ality of a cluster of data points, where th data are ^ ^ ^ ^ 

ellipsoid pattern about a cluster cente , H lh« dimensional gub . 

coordinate axes, then one can to „ ellipS oid with interesting 

space without losing informal This P dime nsionality of the data 

variation along a number of directions tha s les fta ^ 
points. Principal components e , suC h that the variance in 

in the data points. It works by finding a set o ^ ^ 

the data points is highest when projec d onto tho ^ ^ rf ^ ^ 

are computed by finding the electors of the 

P °tour ellipsoid of . geometrically normalized textures t P , let t< oe the set of 
textures with the mean Wn subtracted off 

t_n = ^t*« 

t ;. = t pj -t me an, 1<J<"- 

If we let T be a matrix where the jth column is t' Pj 

r=[t' pl t' P2 •••t' Pn ], 

then the covariance matrix is defined as 

E = TT 1 . 

Moticetu^a^™^ 

vectors. Due to our pixe.wUe repeat on fa^M- > Fortunately , 

..... - - - - 

possible because an eigenvector * of T T 

T*T e, = kei 

tor Te of S This can be seen by multiplying both sides 
corresponds to an eigenvector Te, at L. 
of the above equation by matrix T 

(TT') Tei = A;Te<. 

^ -tfc fhp larger eigenvalues A; explain the 
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m 


1 


■ 










^mean 


■■■ 


e 3 


e 4 


e 5 



retained for the basis sp+ T„ 

applied to a group o!X , JZX e ' ge ™^ Sfr - * P"-Pal 

T -* ^ — to Be made 

can be easijy performed. Say ^^^rT^^™*^ 
m etricall yn „ rmalized lext J to J J" and ,e, t. be a geo- 

Projects t„ onto the e, * " ™-' ,m<! vectorization procedure 

A = e,-( ta -t„„ n ) 

and can reconstruct t„ yielding J. ( 2) 



(6.3) 



(6.4) in matrix form ^ fOT ° ff - hne Pressing. Write equation 

*« = ? 0, 

where t a is written as a column vertnr T • (6,5) 
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projection onto these images. 

j • coft fT The oseudoinverse can be computed 
where rt = (T'T)-'rM S the pseudomverseofr^ePSeu ^ 

off-lmesineeitdependsonlyontheexampletextur est recomtructi on 

performs texture a^sisw^ 

with the columns of T (equation (6.5) j. rig. o 

by the pseudoinverse where n was 40 cornpu ted using a simple 

Note that for both basis sets, the linear co ,. ff 

whether one 

subtracts off the mean .mage W 1» P re , ained . A l s0 , the or- 

wi ,l require fewer projections smce not all ««™^ set of linear coefficients 

-cx:rrhri^ 

^^omvectorl^onexperimentshavebeen with the eigenimage basis, so the 
notation in the next section uses this bas.s set. 

6 3 2 Run-time vectorization 

In this section we go over the ^^^^1^ 
,„ the vectorizer are an ,mage . to vec tome an ^ ^ ^ ^ 
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geometrically normalized 
textures 



texture model E Pjt p . 
(D shape step 
® texture step 



P try to make t a and t a converge to the true t a . 

^ir;;Lt;:i:r:trt: r f pWr — ' - * • - * 

against a solid background facedete ti™i • T °™ ^ Were "*« 
by correlating with a conp tZ tmpZ If ^ ^ 

estimate P, so the final ontpnTof Z I rat °"-t,o„ procedure refines the 
- of A coefficient. «^Tr^£ST *~ ^ ' 

J: stsrss ^r;7 e rr of the shape *- d — - 

procedure from the perspect ve oL f Tu C ° nVer S»<* of vectorisation 
fignre, sets of (!) ^ [ T, ™ ° f ^ '» *> 

the space of our text ™ ZMdT K " te * tures . ™> (3) 

the set of geornetrical,; „™iZ I! "n "** ^ ^ 

faces. The larger and more™, d th H K ' °*. Pr °'° tyPe ° f » "-"l* 
between sets p) J 7s H^w r 8 ;'' 6 ™^' 1 '*"^ 
- true , is ^ 2LJl"JS"JS ^ ^ " "* ^ ' 

from <. to the finaTt. The Pa fa t b is Sh °™ * the curve 

path for t. ,s shown by the curve from initial t„ to final 
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t The texture and shape steps are depicted by the arrows jumping between the 
The texture step using the latest estimate of shape to produce t a , projects 
C t7n7o ^TnSl space. The shape step uses the latest t a to find a new set 
f fences, thus updating shape and hence t a . As one moves along 
curve one is getting better estimates of shape. As one moves along the curve 
17 Efficients in the texture model improve. Since the true t j.es outsme he 
1* space, the iteration stops at final t tt . This error can be made smaller 
by increasing the number of prototypes for the texture model. 
We now look at one iteration step in detail. 



One iteration 

texture since the iteration starts with the texture step. 

In the texture step, first the input image i. is geometrically norm^zed usmg 
the shape estimate y£ M , producing t. 

t a (x) = i.(x+yt i „aW). (6 ' 8) 

where x - (« y) is a pixel location in the standard shape. This is implemented as a 
tlZis warp using the flow vectors pointing from the standard shape to the ,nput 
AT L tdin'sectionS.S.l.hilinearinterpolationisused to ^^'^ 
(,, y) locations. Next t. is projected onto the eigemmages e; nsmg equ o 6^2) to 
Ipdate the linear coefficients ft. These updated coefficents shou d «£* the shape 
computation to synthesize an approximation t„ that ,s closer to the true t 

In the shape step, first a reference image W is synthes.zed from the texture coef 
ficieut using equation (6.3). Since the reference image ^ ^ 

input it should be well suited for finding shape correspondences Next, opt.cal flow .s 
d between t which is geometrically normalized, and ■„, wbch updates the 
computed between t wh.ch ■ g ^ ent _ based hier . 

p xe wise correspondences y„_ slJ . ror opuca „: risl a „H Rereen 

archical scheme of Bergen and Adelson 16], Bergen and Hmgoram [18], and Bergen 
7J m^e new correspondences should provide better geometr.cal normahzat.on 

m « steps until the representation stabilizes is equivalent to 
iteratively solving for the y?_ M and ft which best safsfy 



t« = t 0 , 
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or 

'•(x +yfi«(x)) = t m , m + Aej . 

Adding a global transform 

We introduce a planar transform P to select the image region containing the face and 
o ^normalize the face or the effects of scale and image-plane rotationAet ■< befte 
input , mage resampled under the planar transform P 

>» = . a (P(x)). (69) 

aid l h id7 amP , le<, , i n * Wi " S—W-'ly normalized in the texture step 
and used for optical flow in the shape step. 

Besides selecting the face, the transform P will also be used for selecting snbimages 
around ,„ ivid , feitMes as the eyes, nose, and mouth. As will be explaS 
n the next section on our hierarchical implementation, the vectorization procedure 
T ' ™ ; ""—to*" ^rategy on a pyramid structure. Full face templates are 
n„" 4116 SCakS Md indiVidUal **™ « vectoril at tie 

Transform P will be a similarity transform 

^-(- c iV :::)-(;;), 
L slioethi rn™* Define anchor p °™ ts **■ Md •»-*• in st - d - d 

shape, which can be done mannally in off-line processing. Let q„,, and q„ 2 be 
est.mates of the anchor point locations in the ima«e ,„, estimates which ntd to 
snchtr °" Similarity trmSfOTm Paramet6rS the " determined 

nq.«,.) = q.,„ P(q„«) = q,, 2 . (610) 
This uses the full flexibility of the similarity transform and is used when the 
■mage regmn bemg vectorized contains two reliable feature points such as the 

2 ' feint o T' CO r SP ° ndmCe - ,n ' his »« there is "OX one anchor 

point q, Wil , and one solves for t, and („ such that 



P(q«u,i) = q„,,. 



(6.11) 
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This is useful for vectorizing templates with less reliable features such as the 
no" : L mouth. For these templates the eyes are vectored first and used to 
fix the scale and rotation for the nose and mouth. 

tW a fare finder has provided an initial estimate 

2:^,1 « si « >■ - - - — - — ' - 

there is nonzero flow at the anchor point 

||yl-, M (q.«,i)ll > threshold. 
The correspondences can be nsed to npdate the anchor point estimate 

q a ,i = P(<Ui,< + y^«i(«K«))- 
Next P can be updated using the new anchor point locations using ; equation (6 10) 
:tn) 2* t can be resampled again using equation (6.9) to produce a new v 

Entire procedure 

K533SSSSS=3«S3SSS 

transform P. 
procedure vectorize 

1. initialization 

component of P ■ 

'(b) Resample * a using the similarity transform P, producing * (equation 
(6-9)). 
(c) y^trf = 6 - 

1 W v «f 8- and P by iterating the following steps until the 
2. iteration: solve for y a _ s4(1 , P«> anu J 

# stop changing. 
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(a) Geometrically normalize i' a using yf_^, producing t. 

t -W = »> + yfi«(x)). 

0») Project t. onto example set e„ computing the linear coefficients ft 
A = e t ■ (t„ - t TOMn ), 1 < f < „. 

= t mean -f £? =1 /5-e,-. 

(d) Compute the shape component using optical flow 

y S a-std = °ptica]-flow(^,t 0 ). 

(e) If the anchor points are misaligned, „ indicated by optical flow, then: 

1. Update P with new anchor points. 

H. W-ja. similarity transform P, prodncmg , U e q „ (6.9)). 
111 • y-a-sti = optjcal-flbw(^,t 0 ): 

Fig. 6-9 shows snapshot images of i' t *nA± a ■ 

mput is shown in the upper left We delih»„, . j ' he StartMg 

for the iteration to dem« * " * ^™nt 

transform P As th» ;+« f P roce aure s ability to estimate the similarity 

6-3.3 Pose dependence from the example set 
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HIERARCHICAL IMPLEMENTATION 



fi> + andt during the three iterations of an example 
Figure 6-9: Snapshot images of t 0 , t 0 , and t a during 
vectorization. See text for details. 

mak es the vector pose-dependent, 

construct face space. The iteration s ep & example case, a 

of poses. Thus, even though we have chosen a frontid v,e» ^ 
vectorizer tuned for a different pose can be constructed s.mply by g 
views from that pose. 

specific vectorizers through interpolation. 



6.4 Hierarchical implementation 

Foroptim^^ 

to-fine strategy. Given an input image to vectorize 

station over 4 

and Adelson [33]) is computed 1 to prov.de y?%£££ I then run over the 
scales, the original image plus 3 reduct.ons b 2^ ^J»\ ^ 
coarsest level to provide an m.t.al est.mate for the sm, taty Aj 
vectorizer is run at e.h pyramid eve correspondences 



no 
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6.4.1 Face finding at coarse resolution 

see the face ^ ^ f ^ ^« 

Sung and Poggio fl 251 Sinha m , Z ll^T* Wi and Ra * N, 

_ & "ggio l^DJ, binha [122], and Moghaddam and Pentland fQfil p„ * . 

images, we found that normalized correlation „ • * f 1 J " ° Ur test 

The normalized correlation metric is " ^ ^ S We "' 



< TI > - < T >< I > 



average of all exlls „l u , I ' eVds OVer "» P°P»lations, an 

. frontal pose are shown in F L 77 0 To H ^ °" '"° "^'^ for » 

high correlation respon e to the ernolat ^ -T *° ^ ^ " ith 

of correction matches above a certaio threshoh, are n^flT^^** 
6.4.2 Multiple templates at high resolution 

™ZZ7Z2TZlZu id Ieve,s rt use a wh ° k - «- 

two finer resolnt on Th" el p lt 1 ;: ™"t ^ ™ d -nth for the 

pattern tree approach ,3 " '1, , H "* SCaks is simiI " '° B„rf. 
a coarse scale Car 1^ ITff T f °" 1 -P— ntation. At 
spatial snppor't for tel Tat, i s A Tfi "^"l" ^ — * 



6.5. APPLICATION TO FEATURE FINDING 

of linear texture coefficients and the nose use^ . f ^ inputs always 
eyes and nose provides the extra ^ b h * " ^ extra flexibility is not required 

and keeping to whole-face templates should^ be a help. ^ ^ ^ 

When vectorizing separate eyes, nose, and mouth temp ' fc ^ mJ 

olutions the tempUateof tbe eye, - £ usi „g . iris 

estimates a normalizing similarity 0 nly one anchor 

6 4.3 Example results 

• Fie 6 11 correspondences from the shape component are 
For the example case m F.g. b 11, cor p These segment features are 

plotted over the fonr levels of ^ Gauss, an py m^ Th g 

derated hy mapping at the highest 

To get a sense of the final sha pe/tex tu P F<jr fte ^ 

resolution, Fig. 6-12 displays the final output fa £ g ^ ^ 

nose and mouth templates, we show , the geometr, y^ No 

^r^'Llts presented in the next section on appHcations wil, provide a 
more thorough analysis of the vectonzer. 

6 5 Application to feature finding 

texture coefficients can be used as a low-d,m [129][S|[1031. Our 

ti on, which is the familiar eigemmage appro£* face - g JJ^, .„ 
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leve] 3 





level 0 

' the Mor^A „ . n ? u e h _ se S me nt features wh.ch are generated by map- 



n.na ti,« j ,■ 6 ^gment features wh 

P.ng the averaged l,„e se gme „ ts from Fig. 6-2 to the i„p„, 



image. 
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Figure 6-12: Final vectorization at the original image resolution. 

these correspondences to the problem of locating facial features. Chapter 7 discusses 
how t he shape correspondences can be used for synthesizing v.rtual v.ews. 

After vectorizing an input image .„, pixelwise correspondence 
ponent y«« M provides a dense mapping from the standard shape to the ,mage . 
E T hoTgnth's dense mapping does more than locate just a sparse set <rf feature 

correspondences and then mapping under the simdanty transform P. For a 
point q slJ in standard shape, its corresponding locatmn ,n .s 

P(q,id + y^i(q«i))- 

For example, the line segment features of Fig. 6-2 can be mapped to the input by 
mapping each endpoint, as shown for the test images m F,g. 6-13. 

ers one tuned for afrontal pose (view mS, see Fig. 1-3) and one for a si ghtly rotated 
pose view L), were respective,, tested on the m3 and m4 example v.ews from on 
daUbl Thi image set consists of 62 people, 2 views per person - a frontal and 
pi* rotated pose - yielding a combined test set of 124 images. Example results 
om he a d P :i vi L vectorizer are shown in Fig. 6-14. Because the same v,e WS 
Z d ™ P .e views to construct the vectorizers, a ,eave-6-out cross vahda^n 
procedure was used to generate statistics. That is the orrgmal group rf M ^ mage 
r „ a „iven oose were divided into 11 randomly chosen groups (10 of 6 people, 1 ol 
hTrem Ig 2 pip e). Each group of images is tested using a different vector.zer; 
h eTtor for group G is constructed from an example set cons.sfng of the ongr- 
na , tot: minus the L G. This allows us to separate the people used as examples 
from those in the test set. 
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6.5. APPLICATION TO FEATURE FINDING 



■ 

feature 


• 

detection rate 


average distances 


point metric 


edge metric 


endpt. dist. 
(pixels) 


angle 
(degrees) 


perpend, dist. 
(pixels) 


left eye 


100% (124/124) 


1.24 






right eye 


100% (124/124) 


1.23 






left eyebrow 


97% (121/124) 




5.1° 


1.06 


right eyebrow 


96% (119/124) 




4.8° 


1.06 


nose 


99% (123/124) 


1.45 


3.2° 


0.66 


mouth 


99% (123/124) 




2.2° 


0.53 



Table 6.1: Detection rates and average distances between computed and "ground 
truth" segments. Qualitatively, the eyebrow and nose errors were misalignments, 
while the mouth error did involve a complete miss. 

Qualitatively, the results were very good, with only one mouth feature being 
completely missed by the vectorizer (it was placed between the mouth and nose). To 
quantitatively evaluate the features, we compared the computed segment locations 
against manually located "ground truth" segments, the same segments used for off-line 
geometrical normalization. To report statistics by feature, the segments in Fig. 6-2 
are grouped into 6 features: left eye (c 3 , c 4 , c 5 , ce), right eye (eg, c ]0 , c u , Ci 2 ), left 
eyebrow (ci, c 2 ), right eyebrow (c 7 , c 8 ), nose (n i: n 2 , n 3 ), and mouth (m 1? m 2 ). 

Two different metrics were used to evaluate how close a computed segment came 
to its corresponding ground truth segment. Segments in the more richly textured 
areas (e.g. eye segments) have local grey level structure at both endpoints, so we 
expect both endpoints to be accurately placed. Thus, the "point" metric measures 
the two distances between corresponding segment endpoints. On the other hand, 
some segments are more edge-like, such as eyebrows and mouths. For the "edge" 
metric we measure the angle between segments and the perpendicular distance from 
the midpoint of the ground truth segment to the computed segment. 

Next, the distances between the manual and computed segments were thresholded 
to evaluate the closeness of fit. A feature will be considered properly detected when 
all of its constituent segments are within threshold. Using a distance threshold of 
10% of the interocular distance and an angle threshold of 20°, we compute detection 
rates and average distances between manual and computed segments (Table 6.1). The 
eyebrow and nose errors are more of a misalignment of a couple points rather than a 
complete miss (the mouth error was a complete miss). 
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6.6 Future work 

In this section first we discuss some shorter-term work for the existing vectorizer 

6.6.1 Existing vectorizer 

It would be nice to demonstrate the vectorizer working in cluttered environments To 
accomphsh th ls both the face detection and vectorizer should be made moTr usU 
the preser.se of false positive matches. To improve face detection, we would prooaMy 

t-entland [96J Both of these techmqnes model the space of grey level face imaee. 
-ng prmcpal components analysis. To judge the "faceness" of a im ge 



and the Mahalanobis distance 



l|t.-t.| 



where the A are the eigenspace projection coefficients and A, are the eigenvalues from 

aTT l°ZT nt r ]ysk - This distance metric cou]d be toT t w r 

as a threshold test after the iteration step has converged 

texture components at the finer level as well. P 
6.6.2 Parameterized shape model 

In the current vectorizer, shape is measured in a "data-driven" manner using optical 

where the shape of the ith example image, y£ M , h the 2D warping used to geometri 
^y»°™al,ze« outage in the off-line preparation step. This techninuefor", 
shape ,s sumlar to the work of Cootes, e, aL [44], Blake and Isard [21J B^umb g 
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6.7. SUMMARY 

j c ™ [791 The new shape step would, given i' a and 

rrK rrj ml* - — » - — - ° f 

the approximation ~ 

i # B (x + n=i ^y;!- s td( x )) - ta - 

cedureionecanthinkofitasaparametenzed opt.cajflow ca. 

„ r te recognition ;„ the vectoriz er, the set of a shape coefficients 

representation for a face recognizer. 

6.6.3 Multiple poses 

The steward .a, to handie different ^^TX^, 
K we provide pixelwise correspond nee be W ee„ ^^M^. The 

^^^^^ 

r+ tr:: d ;r: ".Ives ns correspondence 

one for a combined textnral analys.s. This procedure ^ ^ t of 

x-ti's-:^- — - 

recognition. 

6.7 Summary 
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component can be used to ?Pnm f ■ n , ' tUral Malysis > the sha Pe 

analysis can be used to rW„ » f coefficients from the textura] 

«u shaP e e :: e ; r : 1:1 s: — r g the ; " put - we th - 

standard shape, and the inont The A T * which is at 

ra, feedback he ween ^ZL^T ^T^*^" »»- 
between the two nnti. he shapX t COmPUta "° nS b * <«* -d forth 

strated an effi c ; ent J2l^ t T n > n "**>°' We have demon- 

strategy. ™P W »'*"°» of the vectorizer nsing , hierarchica] coarse . to . foe 

in 124 test i^J^^T^™*'" - 

was missed h y S ™ L In the , , ^ ^ ^ ° M ™» U ' 



Chapter 7 

Face recognition using one view 

of pose-invariant face recogmfon: (,) is P available . Chapter 4 discussed 

person, and (ii) only one example v,ewpe « to ^ problem . In this 

LfirsUa^,inwhicha S , m plev,e»-b^dappro^ jllst a ^ Vicense 

chapter weexamine the second case, where for If we wish to recognize new 

photograph is available for each pe^on ,n £ * ^ rf ^ ^ ., mages 

7 1 Introduction 

change, features such as color or geometn £ J Ota — 

.cognition, this approach has " ^ „ y Hems using color (Swain and 
tion (Sinha [122)) and for mdexing of packaged g 

Ballard [126)). Malsburs and collaborators [91)183], 

In the flexible matching approach (von der M^sburg ^ 
a ,so see Chapter 2), the input unage ,s fctamed ™ ^ ^ ^ ^ s0 that 
In 191), the deformation .s dnven by a matcn,n 8 ^ 2D warp rat her than a 

'JriUing transformation between ^^^>»^--»-"' , r 
global, rigid transform. Th,s enables *^ rf orma Jotations . A deformation 
even though they may difler ,„ express onor ou t of pi ^ 
matchingthe input with a ™ d ?' "^ ^t Jometrical distortion induced by 
both the similarity of ^ ^^^Lie (a) constructing a generally 
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0*15 ~: 0 ::zte r ior t r * ° f th * - 

That is, given a single vL of an I I, n 7 ? views, 
additional views of 0 from other v^ vfew^ 7* k^ ° f "* **« **• 
knowledge. As mentioned in secZs Cd ^ T" 

prior knowledge is a generic 3D model of the h,™ f "T* Anting this 
the appearance of a face nnder diffe L„, „ *?' ^ ™ be ™ A «• P-dict 

a 2D face image is texture j£ f£ 5^7^^ °»* 

-d tr - r s in - 

%h'in g s can he modeled byslplin g t £2 ^ difeent 

Given one view of a person a„d"he proto v ^ ™" U " der ' W "^"^ 

a ^ntt - ^ * ^tia, for heing 

examp,e-based approach to bypass 3D Id! 1 3D b t " 
explored in the linear combinations aDmTkT J recc « nitio " "as first 

Poggio (105)). m iinear comhin" oTs ™ f T^" (U,,ma " Md B ™ W 
under rigid 3D transformation ^ 2" tZ ^ ^ " " 
of 2D example views, where the 2D vie conization of a small set 

y- This is valid for 'a £ o "j^"" " «=- 
visible in all views and thus car, be hZl int * """"f ** P<>i " tS « 

™» - - *-=S f :;:^s 

-^^^f^ ~h in the 

virtual view, Normally, >vith jnTt onT^e W 3D " "t'' '"^nndworh f„ r 

any method for generating additioXb Zl v ^u™ " "* P ° SSibfe ' H °" e ver, 
to nse the the linear comhinati ^ Thism -T^p ^i'™ 



„ MonKmwwcEoroBJECTCL^PKOTOTrrEVims m 

example images of prototypal see sectio „ L2), and the Utter 

reSection of the single example can be genera ted I ^ ^ ^ ^ 

, ead s to the idea of linear classes, wh.c h we .< xp * ^ 

After ^enssingmethodsfo^^ = tM ^ ^ 

in the view-based, ^ view availabl e for each person 

example view m4 in F,g. 7-2 s the o y ^ get of 

we will synthesize the remammg set of ™ d J' yiews in the view -based 

„„e real and mnltiple virtnal vews wd! bensed a, ex J ^ ^ rf 

ima L^- - onr work, Lando and W-M £ — — "_ 
tbe same overall question - general, atmn ^m a n k rf ^ 

using a similar example-based techn,, » fa eP-« IV & fa 

In addition, Manrer and von der Malsbn g pi . g more 

normals. 

Prior knowledge of object class: prototype 



7.2 

views 

In our ex: 



„ ,, r example-hased approach for 

transformations snch a, changes ,» rotafon o express^ P ^ ^ 

of prototypical faces. Let there ^^^ ^^-6-. Unlike 
prototypes are chosen ^ v« - vi« « 



non-prototype f 



available for each prototype n . ^ ^ to trans form 

Given a single real view of a novel face at a know P , ^ ^ ^ 

the face to prodnce a rotated v.rtua. v*.. C d f the know ^ ^ rf ^ 

in Fig. 7-1, let 

i = set of iY prototype views at standard pose, 
i ^ = set of yV prototype views at virtual pose, 
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ONE VIEW 




- * - vectorized 

W have been vectorized, producing W vt , "" , Pr< "° tyPe ^ * » d 
V, The specific J™ « ~" f ^ *»' ^ '«""» «*» 

section 7.4. 9 ed '° vector '^ ™ages will be discussed in 

set of prototype views contain a ^ JtTo'T T" *" *"» < he 

of the vectorized representation ilfe th a cor f °" de&iti °" 

across different viewpoints as Tn" '° be COm " uted 

for generating virtual views pZalld V"" , J*""' ' W ° 

requirenreuts in « erms of co respo ■ ^ W difc «" 

quires these correspondences so 1! ! 7 ■ KWPO "' i - ^ defo ™atio„ re- 
On the other hand'inea 7£L Z TT ™ "* " °" e ^ 

each viewpoint. In this latter c J veZZ ion T ^ fe 

across the different prototypes at a ' Le "ose S ' mP ' ha " d,ing C0 ™ P °" d »- 
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7.3 Virtual views synthesis techniques 

Let i n be the single real view of the novel face in standard pose. In this section, we 
discuss how to synthesize the virtual view i n<r using the techniques of linear classes 
and parallel deformation. 

7.3.1 Linear Classes 

Because the theory of linear classes begins with a modeling assumption in 3D, let 
us generalize the 2D vectorized image representation to a 3D object representation. 
Recall that the 2D image vectorization is based on establishing feature correspondence 
across a set of 2D images. In 3D, this simply becomes finding a set of corresponding 
3D points for a set of objects. The feature points are distributed over the face in 3D 
and thus may not all be visible from any one single view. Two measurements are 
made at each 3D feature point: 

1. Shape. The (x,y,z) coordinates of the feature point. If there are n feature 
points, the vector Y will be a vector of length 3n consisting of the x, y, and z 
coordinate values. 

2. Texture. If we assume that the 3D object is Lambertian and fix the lighting 
direction / = (l x , l y , l z ), we can measure the intensity of light reflected from each 
feature point, independent of viewpoint. At the ith feature point, the intensity 
T[i] is given by 

T\i]=p\i\m-i), (7.i) 

where p[i] is the albedo, or local surface reflectance, of feature i and is the 
local surface normal at feature i} 

The texture vector T is not an image; one can think of it as a texture that is mapped 
onto the 3D shape Y given a particular set of lighting conditions I. One helpful way 
to visualize of the texture vector T is a sampling of image intensities in a cylindrical 
coordinate system that covers feature points over the entire face. This is similar to 
that produced by the Cyberware scanner. 

Consider, the relationship between 3D vectorized shape Y and texture T and their 
counterpart 2D versions y and t. The projection process of going from 3D shape Y to 

Actually, it is not strictly necessary for the object to be Lambertian; equation (7.1) could be a 
different functional form of p, ff, and I. What is necessary is that T[i] is independent of lighting and 
viewing direction, which may be achieved by fixing the light source and assuming that the object is 
Lambertian. 
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2D shape y consists of a 3D rotat i™ « i • 

- orthotic projectio „. aST^ - »T points, 

y, we modeJ this nsmg a matrix Z 



y = LY, 

(7.2) 



^Crlll 1 a ^T" ProjeCti0 " ° Pera ' 0r ' 

™ble in the standard shape at viewpoint" " ' ^ "* ^ poi " ts «-* » 
t = DT, 

linear operator. 6 nxed ln T Like operator L, D is a 

AT ^ 

Y = 12 a i Y p> and T = V" /?.T 

i=i ^f J J 1 P j (74) 

forsomesetofa.andftcoetiicients. 

vew Vr through the 3D wclori ^™' *> the destination virtual 

of ■„ at standard P „ se estimates the " 1m 1^ ^' rS '' * 2D ™* analysis 
views i p . Then the virtual view 7 " be 7 , (U) * ' he l***H>e 
and the prototype views i p r Let us now e "™ g ^ «dBdZ. 

and texture of the novel fad ° W eXam ' n<! st ^ - detail for the shape 

Virtual shape 
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In linear classes, we assume that the novel 3D shape Y n can be written as a linear 
combination of the prototype shapes Y Pj 

Y n = Ejli ocjY Pj . (7.5) 

If the linear class assumption holds and the set of 2D views y Pj are linearly indepen- 
dent, then we can solve for the Qfj's at the standard view 

y n = EjLi a;y P> (7.6) 

and use the prototype coefficients a.j to synthesize the virtual shape 

Yn, = Ef =1 ayy Pj , r . (7.7) 

This is true under orthographic projection. The mathematical details are provided 
in Appendix B. 

While this may seem to imply that we can perform a 3D analysis based on one 
2D view of an object, the linear class assumption cannot be verified using 2D views. 
Thus, from just the 2D analysis, the technique can be "fooled" into thinking that 
it has found a good set of linear coefficients when in fact equation (7.5) is poorly 
approximated. That is, the technique will be fooled when the actual 3D shape of the 
novel person is different from the 3D interpolated prototype shape in the right hand 
side of equation (7.5). 

In solving equations (7.6) and (7.7), the linear class approach can be interpreted 
as creating a direct mapping from standard to virtual pose. That is, we can derive a 
function that maps from y's in standard pose to y's in the virtual pose. Let Y be a 
matrix where column j is y Pj , and let Y r be a matrix where column j is y Pj , r - Then if 
we solve for equation (7.6) using linear least squares and plug the resulting a's into 
equation (7.7), then 

y n , r = Y r Y*y n , (7.8) 

where is the pseudoinverse (Y t Y)~ 1 Y t . 

Another way to formulate the solution as a direct mapping is to train a network to 
learn the association between standard and virtual pose (see Poggio and Vetter [110]). 
The (input, output) pairs presented to the network during training would be the 
prototype pairs (y Pj , y Pjl r)- A potential architecture for such a network is suggested 
by the fact that equation (7.8) can be implemented by a single layer linear network. 
The weights between the input and output layers are given simply by the matrix 
Y r YK 
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Virtual texture 

fr t „ md tt . prolotype tex :r; f : 4 ";r< • A ve ; t text " r \ of rr- 

classes can be used to synthesize the v rtual tex tu eT Th ° f 
texture is theu warped or texture mancvl 1 " s J"" hes «cd grey level 
virtual view. The ideas preseutedTuT r """^ ^ '° ^ * faisl >^ 
*o independent* hy vJ^XS^T "~ ""*»"' ^ *. author and 

To generate the virtual textnrp t 
approximation at the standi 7»Z ° P T , S '" g ^ cUs id « ° f 
to the shape case, this relief hi 7 7 , " * V '' eW ' S™ 1 ^ 

T is linearly span ed by a se of w , ^ ° f g ^ '«* ^ 

is borne out by recent succe M 7 ^ °' ' his 

combination of the prototype textures t" " ^ Wn " en M a 

T„ = E,",ft T p . 

t " J SiA«* (7 10) 

and use the same set of coefficients to reconstruct the texture of the virtual view 

light source is kept fixed J he ^tween object and 

section 7.5 for recognition exoerinW T , GXample images and 

^ the same ^J^J^^^ - W. »e can 

7.3.2 Parallel deformation 
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then the linear class idea can be interpreted as finding a 2D deformation from y n to 
y nr . Having shape vectors in cross view correspondence simply means that the y 
vectors in both poses refer to the same set of facial feature points. The advantage 
of computing this 2D deformation is that the texture of the virtual view can be 
generated by texture mapping directly from the original view i n . This avoids the 
need for additional techniques to synthesize virtual texture at the virtual view. 

To see the deformation interpretation, subtract equation (7.6) from (7.7) and move 
y n to the other side, yielding 

yn, r = yn + Ef =1 «;W-y Pj ). (7.12) 

Bringing shape vectors from the different poses together in the same equation is legal 
because we have added cross view correspondence. The quantity Ay^ = y Pj>r — y Pi is 
a 2D warp that specifies how prototype j's feature points move under the prototype 
transformation. Equation (7.12) modifies the shape y„ by a linear combination of 
these prototype deformations. The coefficients of this linear combination, the ory's, 
are given by Vfyn 5 the solution to the approximation equation (7.6). 

Consider as a special case the deformation approach with just one prototype. In 
this case, the novel face is deformed in a manner that imitates the deformation seen 
in the prototype. This is similar to performance-driven animation (Williams [136]), 
and Poggio and Brunelli [108], who call it parallel deformation, have suggested it as 
a computer graphics tool for animating objects when provided with just one view. 
Specializing equation (7.12) gives 

y„,r = yn + (y P ,r-y P ), (7.13) 

where we have dropped the j subscripts on the prototype variable p. The deformation 
Ay = y p , r - y p essentially represents the prototype transform and is the same 2D 
warping as in the multiple prototypes case. 

By looking at the one prototype case through specializing the original equations 
(7.6) and (7.7), we get y„ = y p and y n>r = y P(r . This seems to say that the virtual 
shape y n>r is simply that of the prototype at virtual pose, so why should equation 
(7.13) give us anything different? However, the specialized equations, which approxi- 
mate the novel shape by prototype shape, are likely to be poor approximations. Thus, 
we should really add error terms, writing y n = y p + y erron anc * Yn,r = y P>r + yerror 2 - 
The error terms are likely to be highly correlated, so by subtracting the equations - 
as is done by parallel deformation - we cancel out the error terms to some degree. 
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7.3.3 Comparing linear classes and parallel deformation 

What are some of the relative advantages of linear classes and parallel deformation' 
F rst confer some of the advantages of linear classes over parallel deformation. 
Parallel deformation works well when the 3D shape of the prototype matches the 3D 

™fed "h Tf , " ' h<! tW ° 30 Shap6S diffCT en ° U S h - the vi rt"al view gen- 
the oth fi f I t T " Wi " ^ metrically distorted. Linear classes on 
the other hand, effectively tr.es to constrnct a prototype that matches the novel shape 
by tak.ng the proper hnear combination of example prototypes. Another advantage 
of hnear classes ,s that correspondence is not required between standard and virtual 
poses. Thus, hnear classes may be able to cover a wider range of rotations ont of the 
image plane as compared to parallel deformation. 

One advantage of parallel deformation over linear classes is its ability to preserve 
pecuhant.es of textnre such as molesor birthmarks. Parallel deformation will preserve 
such marks s.nce it samples textnre from the original real view of the novel person's 
face. For hnear classes, it is most likely that a random mark on a person's face will 
be onts.de the hnear textnre space of the prototypes, so it will not be reconstrncted 
in the virtual view. 



7.4 Generating virtual views 

fn our approach to recognizing faces using just one example view per person, we first 
expand the example set by generating virtual views of each person's face. The full set 

tteTo n " T u " im f ' y ' ike '° have fOT oar f - — « - 

he set of 5 example v.ews shown in Fig. 7-2 and originally used in the recognizer 
fern Chapter 4. These v.ews evenly sample the two rotation angles out of the image 

While Fig 7-2 shows 15 real views, in virtual views we assume that only view 
m ". aVa,laUe " d we s y» th <*™ A* raining ,4 example views. For the single 
real v.ew, an off-center view was favored over, say, a frontal view because of the 
ecogmtmn results for bilaterally symmetric objects of Poggio and Vetter [1101. When 
he smgle real v.ew is from a nondegenerate pose (i.e. mirror reflection is nit equal 

an e „ ed V 'f W) ' V T ''^"'^ P ™ ife * — ><< «ew that 

can be used for recogn.t.on. The choice of an off-center view is also supported by 

he psychophys.ca experiments of Schyns and Biilthoff [116). They found that when 

better when the smgle tram.ng v.ew is an off-center one as opposed to a frontal pose 
In complet.ng the set of 15 example views, the 8 views neighboring m4 will be 



7.4. GENERATING VIRTUAL VIEWS 



129 



r~rnrnrTri 
rnrnHHr~ 
Hr~r~irir~ 



Figure 7-2: The view-based face recognizer uses 15 views to model a person's face. 
For virtual views, we assume that only one real view, view m4, is available and we 
synthesize the remaining 14. 

generated using our virtual views techniques. Using the terminology of the theory 
section, view m4 is the standard pose and each of the neighboring views are virtual 
poses. The remaining 6 views, the right two columns of Fig. 7-2, will be generated by 
assuming bilateral symmetry of the face and taking the mirror reflection of the left 
two columns. 

We now describe how parallel deformation and linear classes were used to expand 
the example set with virtual views. Recognition results with these virtual views are 
summarized in the next section. 



7.4.1 Parallel deformation 

The goal of parallel deformation is to map a facial transformation observed on a 
prototype face onto a novel, non-prototype face. There are three steps in implement- 
ing parallel deformation: (a) recording the deformation y p<r — y p on the prototype 
face, (b) mapping this deformation onto the novel face, and (c) 2D warping the novel 
face using the deformation. We now go over these steps in more detail, using as an 
example the prototype views and single novel view in Fig. 7-3. 

First, we collect prototype views i p and z' P)r and compute the prototype deforma- 
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Figure 7-3: In parallel deformation, (A) the prototvne flow ^ ■ fi . 
between i and i mwi, fl ■ , pr ° t0type How ^V-p is first measured 

Detween^ ir and p , (B) the flow JS mapped onto the novel face i n , and (C) the novel 
face ,s 2D warped to the virtual view. 1 ' 

tion 

yp,r- P = vect(?p ir ,f p ) 

tht nS 2Dd ^ "T' Sh ° Wn c ° Ver ' ayed " ' he « «- •* of Fig- 7-3 

knowledge of face rotat.on. To assist the correspondence calculation, a sequence of 

oticaTT ^ ° VirtUal ^ ^ USed inS '- d ° f *-» & 

last W ' ^ C ° mPUted C ° nCatena ' ed '° «* "» te «~ fc» to 

cha„ N , eX M, he f "'I 1 "" def ° rmati0n " maPPed ° nt ° the "°™> Peon's face by 
changingthereferenceframeofy? from « to £ p; K( ;„, y 

between i, and i„ are computed ' ' ' " terPerS ° n 

yS-j = vect(i n ,y 

and used to change the reference frame 

y P V„ = fwarp-vectfy^y;^). 

dard v,ew. As the mterperson correspondences are difficult to compute, we evaluated 
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proto A proto B proto C 



Figure 7-4: The prototypes used for parallel deformation. Standard poses are shown. 

two techniques for establishing feature correspondence: labeling features manually on 
both faces, and using the vectorizer from Chapter 6 to automatically locate features. 
More will be said about these two approaches shortly. 

Finally, the texture from the original real view i n is 2D warped onto the rotated 
face shape, producing the final virtual view 

i n ,r = i n +(p,r- P ) = fwarp(i«, y£ p _ p ). 

Referring to our running example in Fig. 7-3, the final virtual view is shown in the 
lower right. 

In this procedure for parallel deformation, there are two main parameters that 
one may vary: 

1. The prototype. As mentioned previously, the accuracy of virtual views generated 
by parallel deformation depends on the degree to which the 3D shape of the 
prototype matches the 3D shape of the novel face. Thus, one would expect 
different recognition results from different prototypes. We have experimented 
with virtual views generated using the three different prototypes shown in Fig. 7- 
4. In genera], given a particular novel person, it is best to have a variety of 
prototypes to choose from and to try to select the one that is closest to the 
novel person in terms of shape. 

2. Approach for interperson correspondence. In both the manual and automatic 
approaches, interperson correspondences are driven by the line segment features 
shown in Fig. 7-5. The automatic segments shown on the right were located 
using our face vectorizer from Chapter 6. The manual segments on the left in- 
clude some additional features not returned by the vectorizer, especially around 
the sides of the face. Given these sets of correspondences, the interpolation 
method from Beier and Neely [13] (see section 5.2.1) is used to interpolate the 
correspondences to define a dense, pixelwise mapping from the prototype to 
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example example 
manual segments automatic segments 



Figure 7-5: Parallel deformation requires correspondences between the prototype and 

W t C ° r ;?T/ enCeS ^ driVGn ^ ~ features'shownin h 

figure, rhe features on the left were manually located and th 0 f t o. V 

were automatically located using the vectorize, ' * °" "» 

novel face. For the automatic case, we did try to use the dense y shape vectors 
d rectly to avo.d the Beier and Neely interpolant. However, the LorLTm 

™« are defined only ,„ those regions. This, in turn, made the optical flow 
co„espondence_s.ep m the recognizer- itself ™re-diflic„lt The Beier and Nee" 
nterpolatmn method provides a simpfe way to extrapolate the correspondents 
defined over the center part of the face to the face periphery. 

Figures 7-6 and 7-7 show example virtual views generated using prototype A with 
he real Vlew ,„ ft center ^ ^ ** - * 

prototypes, F,g. 7-8 shows v.rtual v.ews generated from all three prototypes For 
companson purposes, the real view of each novel person is shown on the rfght 

7.4.2 Linear Classes 

aUhTsttdarT- T "* ^ te ™ S » f «» P~Wyp- 

standard T rcC ° nStn,Ct * ""^ ™ W - ,n the —W. step atThe 

standard v,ew, we decompose the shape free texture of the novel view t„ in terms of 
the N shape free prototype views t Pj 

t " = 5^ift*«. ' (7.14) 

fortteT 'thJnolT T" COeffidentS ' S0 '™« ' his 

the ft, the novel v,ew ,„ and prototype views i Pj must be vectorized to produce 
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Figure 7-6: Example virtual views using parallel deformation. Prototype A was used, 
and interperson correspondence y£_p was specified manually. 




Figure 7-7: Example virtual views using parallel deformation. Prototype A was used, 
and interperson correspondence y„- p was computed automatically using the image 
vectorizer. 
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real view 




Figure 7-8: Example virtual vie™ as the prototype person is varied. The correspond 
mg real v,ew of each novei person is shown on the right for comparison. " 

the geometrically normalized textures t„ and t, , 1 < < N "since the . • 

he put into correspondence manually in an omime step (using* the Beier and Neely 

this standard shape be denoted as y std . 

Our irnage vectorizer, which we describe in Chapter 6, is used to solve for the 

Z7::tZ Y r d betWGen ^ Standard y~ Th - ™t onlfe 

can then be used to geometrically standardize 

t»(x) = i„(x + y ;_ J „ J (x)), 

where x is an arbitrary 2D point (*,„) in standard shape. Fig. 7-9 on the left shows 

13of Tn ^ *" -tortf r T L 

right side of the figure shows templates t n of the eves nose and o.u 
been geometrically normalized using the correspondent ^ / ^ 

Next the texture t„ is decomposed as a linear combination of the prototype 
textures followmg equation (7J4). First, combine the ft terms into a column v^to 
fi and define a matr.x T of the prototype textures, where the M column of T t 
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Figure 7-9: Using correspondences from our face vectorizer, we can geometrically 
normalize input i n , producing the "shape free" texture t n . 

Then equation (7.14) can be rewritten as 

t„ = T/3. 

This can be solved using linear least squares, yielding 

P = Th n , 

where is the pseudoinverse (T'T) -1 !''. 

The synthesis step assumes that the textural decomposition at the virtual view is 
the same as that at the standard view. Thus, we can synthesize the virtual texture 

W = £/3it p ., r , 

3=1 

where t Pj)7 . are the shape free prototypes that have been warped to the standard shape 
of the virtual view. As with the t Pi 's, the t Pj , r 's are put into correspondence manually 
in an off-line step. If we define a matrix T r such that column j is t Piir , the analysis 
and synthesis steps can be written as a linear mapping from t n to t„ )r 

t n , r = T T Th n . 

This linear mapping was previously discussed in section 7.3.1 for generating virtual 
shapes. 

Fig. 7-10 shows a set of virtual views generated using the analysis of Fig. 7-9. Note 
that the prototype views must be of the same set of people across all nine views. We 
used a prototype set of 55 people, so we had to specify manual correspondence (see 
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Figure 7-10: Example virtual views for linear classes. 
Fig. 6-1) for 9 views of each person to set up the shape free views. When .eneratine 

:;; ry v ; e .™ for a p^^p^ of . eourse , remove 

proto ype set ff he were initially present, following a cross validation metho Tg 

Notice from F,g. MO that by using the shape free textural representation the 
v,r tual v,ews ln this experiment are decoupled from shape and benT^s 

the standard s ape of the virtual pose. The only difference between the views of 
different people at a fixed pose will be their texture. 



7.5 Experimental results 

some ™„or changes to the optical flow step since the whole face template^ 

z'tenit:^ optical flow h — ~ — 

samlT TT'' " 0fl ° teSti " g Vi6WS PCT P ere ™ « re W» to randomly 
sample poses w,th,n the overall range of poses in Fig. 7-2 Roughly half of the , t 

tested. There are 62 people in the database, including 44 males and 18 felal" 
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interperson 
correspondence 


prototype 


A 


B 


C 


manual 


84.5% 


83.9% 


83.9% 


auto 


85.2% 


84.0% 


83.4% 



Table 7.1: Recognition rates for parallel deformation for the different prototypes and 
for manual vs. automatic features. 

people from different races, and an age range from the 20s to the 40s. Lighting for 
all views is frontal and facial expression is neutral. 

Table 7.1 shows recognition rates for parallel deformation for the different pro- 
totypes and for manual vs. automatic features. As with the experiments with real 
views in Chapter 4, the recognition rates were recorded for a forced choice scenario 
- the recognizer always reports the best match. In the template-based recognizer, 
template scale was fixed at an intermediate scale (interocular distance = 30 pixels) 
and preprocessing was fixed at dx+dy (the sum of separate correlations on the x and 
y components of the gradient). These parameters had yielded the best recognition 
rates for real views in Chapter 4. The results were fairly consistent, with a mean 
recognition rate of 84.1% and a standard deviation of only 0.6%. Automatic feature 
correspondence on average was as good as the manual correspondences, which was 
a good result for the face vectorizer. In the manual case, though, it is important to 
note that the manual step is at "model-building" time; the face recognizer at run 
time is still completely automatic. 

Fig. 7-11 summarizes our experiments with using real and virtual views in the 
recognizer. Starting on the right, we repeat the result from Chapter 4 where we 
use 15 real views per person. This recognition rate of 98.7% presents a "best case" 
scenario for virtual views. The real views case is followed by parallel deformation, 
which gives a recognition rate of 85.2% for prototype A and automatic interperson 
correspondences. Next, linear classes on texture yields a recognition rate of 73.5%. To 
put these two recognition numbers in context, we compare them to a "base" case that 
uses only two example views per person, the real view m4 plus its mirror reflection. 
A recognition rate of 70% was obtained for this two view case, thus establishing a 
lower bound for virtual views. Parallel deformation at 85% falls midway between the 
benchmark cases of 70% (one view + mirror reflect.) and 98%, (15 views) so it shows 
that virtual views do benefit pose-invariant face recognition, 

In addition, the leftmost bar in Fig. 7-11 (one view) gives the recognition rate 
when only the view m4 is used. This shows how much using mirror reflection helps 
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7.6 Discussion 

7.6.1 Evaluation of recognition rate 

While the recognition rate using virtual views, ranging from 85% for parallel defor- 
mation to 73% for linear classes, is much lower than the 98% rate for the multiple 
views case, this was expected since virtual views use much less information. One way 
to evaluate these rates is to use human performance as a benchmark. To test human 
performance, one would provide a subject with a set of training images of previously 
unknown people, using only one image per person. After studying the training images, 
the subject would be asked to identify new images of the people under a variety of 
poses. Moses, Ullman, and Edelman [97] have performed this experiment using testing 
views at a variety of poses and lighting conditions. While high recognition rates were 
observed in the subjects (97%), the subjects were only asked to discriminate between 
three different people. Bruce [25] performs a similar experiment where the subject is 
asked whether a face had appeared during training, and detection rates go down to 
either 76% or 60%, depending on the amount of pose/expression difference between 
the testing and training views. Schyns and Biilthoff [116] obtain a low recognition 
rate, but their results are difficult to compare since their stimuli are Gouraud shaded 
3D faces that exclude texture information. Lando and Edelman [84] have recently 
performed computational experiments to replicate earlier psychophysical results in 
[97]. A recognition rate of only 76% was reported, but the authors suggest that this, 
may be improved by using a two-stage classifier instead of a single-stage one. 

Direct comparison of our results to related face recognition systems is difficult 
because of differences in example and testing views. The closest systems are those 
of Lando and Edelman [84] and Maurer and von der Malsburg [93]. Both systems 
explore a view transformation method that effectively generates new views from a 
single view. The view representation, in contrast to our template-based approach, is 
feature-based: Lando and Edelman use difference of Gaussian features, and Maurer 
and von der Malsburg use a set of Gabor filters at a variety of scales and rotations 
(called "jets"). The prior knowledge Lando and Edelman used to transform faces 
is similar to ours, views of prototype faces at standard and virtual views. They 
average the transformation in feature space over the prototypes and apply this average 
transformation to a novel object to produce a "virtual" set of features. As mentioned 
above, they report a recognition rate of 76%. Maurer and von der Malsburg transform 
their Gabor jet features by approximating the facial surface at each feature point as 
a plane and then estimating how the Gabor jet changes as the plane rotates in 3D. 
They apply this technique to rotating faces about 45° between frontal and half-profile 
views. They report a recognition rate of 53% on a subset of 90 people from the FERET 
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Lighting 



For changes in lighting conditions, the prototype faces are fixed in pose but the 
position of the light source is changed between the standard and virtual views. Un- 
fortunately, changing the direction of the light source violates an assumption made 
for linear classes that the lighting conditions are fixed. That assumption had allowed 
us to ignore the fact that surface albedo and the local surface normal are confounded 
in the Lambertian model for image intensity. 

However, the idea of parallel deformation can still be applied. Parallel deformation 
assumes that the 3D shape of the prototype is similar to the 3D shape of the novel 
person. Thus, corresponding points on the two faces should have the same local 
surface normal. The following analysis focuses on the image brightness of the same 
feature point on both the prototype and novel face. The two feature points may have 
been brought into correspondence through a vectorization procedure. Let 



7/ = surface normal for both the prototype 
and novel faces 
Ltd — light source direction for standard lighting 
^virtual — light source direction for virtual lighting 
Pproto = albedo for the prototype face 
Pnov — albedo for the novel face 



The prior knowledge of the lighting transformation can be represented by the ratio 
of the prototype image intensities under the two lighting directions 



Simply by multiplying by the image intensity of the novel person p nov (v ■ Ltd) and 
cancelling terms, one can get 



which is the image intensity of the novel feature point under the virtual lighting. 
Overall, the novel face texture is modulated by the changes in the prototype lighting, 
an approach that has been explored by Brunelli [27]. 

Expression 

In this case, the prototypes are fixed in pose and lighting but differ in expression, 
with the standard view being, say, a neutral expression and the virtual view being a 
smile, frown, etc. When generating virtual views, we need to capture both nonrigid 



Pprotojr} ■ I virtual) 
Pproioil • Ltd) 



Pnovi*} ■ Lirtual), 
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at virtual pose. We demonstrated this for the grey level, or textural, component of 
the face. 

To evaluate virtual views, they were then used as example views in a view-based, 
pose-invariant face recognizer. On a database of 62 people with 10 test views per 
person, a recognition rate of 85% was achieved in experiments with parallel deforma- 
tion, which is well above the base recognition rate of 70% when only one real view 
(plus its mirror reflection) is used. Also, our recognition rate is similar to other face 
recognition experiments where extrapolation from the pose-expression range of the 
example views is tested. Overall, for the problem of generating new views of an object 
from just one view, these results demonstrate that the 2D example-based technique, 
similarly to 3D object models, may be a viable method for representing knowledge of 
object classes. 
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Chapter 8 



Discussion 



8.1 Summary 



This thesis has addressed the problem of automatic face recognition, the task of 
visually identifying a person in an input image. While this problem has been studied 
in the computer vision and pattern recognition communities for over two decades, 
most existing work in face recognition has limited the scope of the problem by dealing 
primarily with frontal views, neutral expressions, and fixed lighting conditions. To 
help generalize existing face recognition systems, this thesis has looked at the problem 
of recognizing faces under a range of viewpoints. The difficult part of this is to handle 
the two rotations out of the image plane, or rotations "in depth". In particular, we 
considered two cases of this problem: 

1. many example views are available of each person, and 

2. only one view is available per person. 

In the latter case, perhaps the single available view per person is from a driver's 
license or passport photograph. 

8.1.1 Multiple views per person 

In the multiple views case, a simple view-based approach is taken to build a pose- 
invariant face recognition system. Each person in the database is represented using 15 
views that sample a range of viewpoints on the viewing sphere. The 15 views include 5 
rotations left/right (covering the range [—30°, 30°]) and 3 rotations up/down (covering 
the range [—20°, 20°]). Each view is represented using templates of the eyes, nose, 
and mouth, which is motivated by the prior success of template-based recognition 
systems for frontal views of faces [11] [28] [59]. 

Given a new input image to recognize, the view-based recognizer follows a basic 
strategy of first geometrically registering the input image with stored example views 



145 



146 

CHAPTERS. DISCUSSION 

and then using normalized correlation to evaluate the match To n f 

metrical registration the W cf~ . perform the geo- 

pose-invariant since it is letZt^Z ^ T '° * ^ P ™" »" 
requirements, we nse a ^^12'^! "» 7 'V ^ ^ T ° < W 
"Sing hundreds of templates o T ^ ** '° Cating ' he eves *» d ■«*. 

people and differen 2 Th f t ^7*/*°" f ™ - variety of -exemplar" 

g ood mate betweiZLpl : : r jt t :: s prob, r he r s - ° f • 
^ » r^^rr from r-i^ 

«He inpnt image and a ^^3, ST?* be '^ 
« registered with the example view in two Jepl '" PUt ^ 

1 undeTJ^lZsZ:" ^ ^ i-ge is resampled 

with the samTfei:^ 

Then the afBne-transformed i„7 t ! V,eW Usi "S °P tical «»"• 

ister the two ^^l^™^ ^ "correspondences to re, 

to help provide some invariant to differences in Zu ^ T n " "" d 

the image plane so all threP ™f r Ta ^ mdude a rotati ™ ™ 

-^=:rr;r:z:.-=j^ 



8.1. SUMMARY 



147 



8.1.2 Single view per person 

In the second case of pose-invariant face recognition, we assume that only a single 
example view of each person is available in the database. In this case, is it still 
possible to recognize these people under a variety of poses, especially when new 
input views differ from the single available example by a rotation in depth? In 
this thesis, we reduced this case to the multiple views case by synthesizing virtual 
views [106] [110]. Virtual views are new views of an object as seen from different 
poses, lighting conditions, or expressions. For our problem of pose-invariant face 
recognition, we are interested in virtual views under different rotations in depth. 

How does one synthesize virtual views of an object? If the object belongs to a 
specific class of objects - such as the class of faces - then one may be able to take 
advantage of modeling assumptions on the class level to synthesize virtual views. That 
is, if one has prior knowledge about the object class, then one may be able to apply 
that knowledge to a single view of a "novel" object to synthesize virtual views of it. 
For instance, if one knows how faces change appearance under some transformation 
(e.g. rotation in depth, expression change), then that transformation can be applied 
to the single view available of a "novel" face. In this thesis, prior knowledge of face 
rotation in depth was encoded by using 2D views of prototype faces. Let the pose of 
the single novel object be defined as standard pose, and let the pose of the desired 
virtual view be virtual pose. Then the required views of the prototype faces are at 
both standard and virtual poses. 

Two techniques were presented for synthesizing virtual views. First, the tech- 
nique of parallel deformation (also see [108]) maps a transformation observed on a 
prototype object onto the novel object. The prior knowledge of the transformation is 
represented by a 2D deformation of feature locations on the prototype face. This 2D 
deformation records feature correspondence between the standard and virtual poses of 
the prototype, and we measure it using optical flow. Next, the prototype deformation 
is mapped onto the novel object, which requires feature correspondence between the 
standard poses of the prototype and novel objects. In this thesis, we explored both 
manual and automatic techniques for establishing these feature correspondences, the 
latter of which is based on our face "vectorizer". Finally, the mapped prototype de- 
formation is used to 2D warp the novel object to synthesize the virtual view. Overall, 
the accuracy of the virtual view depends on the degree to which the 3D shapes of the 
prototype and novel objects match. 

The second technique for synthesizing virtual views uses the concept of linear 
classes (see [106][110]). The main idea behind linear classes is that the novel object 
can be decomposed as a linear combination of a set of prototype objects. This de- 
composition is performed separately for object shape (locations of a set of feature 
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recognition, this may be helpful for increasing the number of training examples 
in a learning-from-examples framework. 

8.2.2 Secondary contributions 

The secondary contributions stem from the feature finding requirements of the view- 
based recognizer and our techniques for synthesizing virtual views. 

1. Person- and pose-independent feature finder. We developed a template-based 
system for automatically locating the two irises and a nose feature. The system 
works under a range of rotation angles both in and out of the image plane. While 
it was used in this thesis as a preliminary step in the face recognizer, it could 
also be used for applications like human-computer interaction or low-bandwidth 
videoconferencing, if only to initialize a tracking system. 

2. Face "vectorizer". The face vectorizer is an automatic and person-independent 
technique for (i) locating a dense set of facial features, and (ii) modeling the 
grey levels of the face as a linear combination of prototype faces. The vectorizer 
works by exploiting a positive feedback loop between the feature correspondence 
process and the grey-level texture model which uses linear combinations. While 
the primary use of the vectorizer is to find feature correspondence between two 
arbitrary faces, the representation returned by the vectorizer is fairly general 
and could be used for tasks such as feature detection, expression analysis, and 
face recognition. 

8.3 Future work 

8.3.1 Analysis by synthesis 

In dealing with varying pose, our face recognizer uses a simple view-based approach 
that stores many example views per person. Recognizing an input view boils down 
to retrieving an example view that is a close match. One problematic feature of 
this approach happens when one tries to recognize an input view whose pose falls 
midway between the poses of the closest example views. Our current system tries to 
compensate for this by resampling the input under an affine transform and warping 
the result using optical flow. The latter optical flow step, however, is ad hoc and 
could be improved. 

A cleaner but slightly more complicated approach for the pose-invariant recogni- 
tion problem is analysis by synthesis: The basic idea behind this approach is to try 
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to resynthesize ,he input view using a face synthesis module. On an abstract level 
one can thmh of the face synthesis module as a parameterized face model ' 
Af(p) = face-model (p) 
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vector p. Pseudocode for a generalized version of the algorithm is as follows 

algorithm: analysis-by-synthesis 

input: image / 

output: parameter vector p 

(1) p = initial estimate 
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Figure 8-1: In our proposed analysis-synthesis system for varying pose, we interpolate 
between the 15 example views of each person. The 15 views are divided into 8 cells 
of 4 views each, and p = (ri,r 2 ). 

This analysis by synthesis approach can be applied in three different areas of this 
thesis. 

1. Recognition using multiple views. This is the scenario where all 15 views per 
person are available in the database. The current view-based system compares 
an input against a particular person by iteratively matching the input against 
9 of the 15 views. The new analysis-synthesis system would first interpolate 
the views to try to reconstruct the input. Then this reconstruction would be 
correlated against the input. 

2. Multi-view vectorizer. The current vectorizer is tuned to the particular out-of- 
plane rotation of the prototype "training" views. It should be possible to link 
a set of vectorizers as shown in Fig. 8-1 by defining correspondence between 
the standard shapes of vectorizers in the same cell. In this new multi-view 
vectorizer, the parameter vector p would not only include (r],r 2 ), but also the 
linear texture coefficients from the vectorizer /?,. The similarity transform P 
and correspondences y^-std would be auxiliary variables. For correspondence, 
there would be a set of standard shapes, one standard shape per cell. 

3. Recognition using one view. This is the scenario addressed by the second part 
of the thesis, where only one view of each person is available in the r database. 
Instead of synthesizing a set of virtual views off-line before recognition, a single 
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assumption that the 3D shape vector of the input Y and the 3D texture vector T 
are linear combinations of the shapes and textures of prototype faces. Under certain 
conditions, the linear coefficients {ctj,(3j) of the 3D decomposition are computable 
from an arbitrary 2D view. Thus, the coefficients should be invariant to pose since 
they are derived from a 3D representation. It follows that the (a,-,/?,) coefficients 
should themselves be an effective representation for faces. The coefficients of the 
unidentified input view (aj>0j) tnput can be directly matched against the database 
coefficients of each person at standard pose (otj, /3j) std . Note that the linear coefficients 
are not a true invariant because the recognizer at run-time needs to have an estimate 
of the out-of-plane image rotation of the input. 

8.4 Closing remarks 

This thesis has studied the problem of recognizing faces under varying pose. By 
addressing the issue of pose, our immediate goal was to help bring face recognition 
one step closer to the ultimate goal of recognition under general imaging conditions. 
Beyond pose, the other major sources of variation that need to be addressed are 
lighting conditions and expressions. But even if building a system that handles all 
these sources of variation proves elusive, there are still some applications where it is 
safe to assume restricted imaging conditions (e.g. verification for building access). 

Beyond recognizing faces, there are many interesting tasks that involve processing 
images of faces. Take, for example, the problem of detecting faces in cluttered scenes 
or the problem of estimating facial parameters such as pose, expression, mouth ar- 
ticulation, and lighting conditions. Solutions to these problems will be be useful in 
applications like model-based coding for low-bandwidth videoconferencing, human- 
computer interaction, and performance-driven animation systems. 

The success of our view-based face recognizer has an impact not only in the study 
of faces, but also lends some computational support to the use of the view-based ap- 
proach in object recognition. Our experimental results supplements recent support for 
the view-based approach from psychophysics (Biilthoff, Edelman, and Tarr [29]), neu- 
rophysiology (Logothetis and Pauls [88]), and object recognition experiments (Poggio 
and Edelman [107]). 

Within the larger context of object recognition, this thesis has addressed a dis- 
crimination task using a view-based approach. The recognition strategy developed 
in this thesis may be useful in other subordinate-level recognition tasks, such as the 
identification of animals like dogs and cats, or the identification of cars. One area 
that has not been addressed is the more general problem of basic-level recognition or 
categorization. The view-based approach may be useful here as well, but so may other 
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approaches such as parts-based representations or 3D models. More stndy certainly 
needs to be done ,„ h,s area, and there is probably not jnst one answer. In tackling 
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approaches; we wonld be surprised if the view-based approach were not one of them 



Appendix A 
Face Database 



The face database contains 62 people, 25 views per person. So that we can explore 
the issue of pose, the different views of each person cover a variety of poses, including 
rotation angles both in and out of the image plane. 

The database is divided into two parts, a set of example views and a set of testing 
views. 

1. Example views. In our view-based approach for face recognition under varying 
pose, faces are represented using 15 example images that cover the viewing 
sphere. Shown in Fig. A-l, these views sample 5 left/right rotations and 3 
up/down rotations. When a subject is added to the database of faces, example 
and test image data is taken with a camera perched on top of a workstation 
monitor. To help collect the example views, we fit a large piece of foam core 
around the monitor with dots indicating the viewing sphere locations being 
sampled. When taking the example views, the subject is asked to rotate his 
head to point his nose at each of the 15 dots. No mechanisms are used to make 
the subjects poses accurate relative to the ideal "dot" poses other than our 
oral instructions fine tuning the subject's pose. This field of dots sample the 5 
left/right rotations at approximately -30, -15, 0, 15, and 30 degrees and the 3 
up/down rotations at approximately -20, 0, and 20 degrees. The two rotation 
parameters are restricted so that the two eyes are always visible; this is why 
the left/right rotation parameter is not sampled beyond 30 degrees. 

2. Test views. In addition to the 15 example views, 10 test views are taken per 
person. For these test views, the subject is instructed to choose 10 points at 
random within the rectangle defined by the outer border of dots. The test poses 
can fall close to example poses or in between them. The 10 views are divided 
into two groups of 5. The first group is similar to the example views in that 
only the left/right and up/down rotational parameters are allowed to vary. For 
the second group of 5, the subject is allowed to introduce image-plane rotation. 
See Fig. A-2 for example test views. 
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Figure A-l: The view-based face 



recognizer uses 15 example views per person. 




Figure A-2: For each person, 10 test images are taken that sample random poses from 
the viewing sphere. y 

We currently have 62 people in the database for a total of 930 example and 620 
est, g v.ews. The ejection of people is fairly varied, including 44 males and ,8 
females, people from Afferent races, aord an age range from the 20s to the 40s The 
frontal v.ews of everyone in the database are shown in Figs. A-3 and A-4 

of a 60„ » , th<i eXam .? ™ d t6Sl VieWS ' th<i lighti " g C0 " diti0 " S " e «»* ™* consist 

1 IT dT p M ear p- r erasupplemente<i by bMkgro ™ d f ™ "■■*>•. 

and overhead hghts. Fac.al expression is also fixed at a neutral expression 

After takmg the example and test images, we manually specify the locations of 

tZ°Z2 nose ?; n r corners of the mouth (see F * a - 5 >- Th - — > 

leature Jocations are used for four purposes: 

1. During batch evaluations of the feature finder, they serve as ground truth data 
for vahdatmg the locations returned by the feature finder. 



157 




Figure A-3: An exhaustive listing of people in the database, part 1 of 2. 
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Figure A-5: The irises, nose lobes, and corners of the mouth are manually labeled for 
each image in the database. • 

2. Also in the feature finder, the manual locations define the exact (x,y) loca- 
tions of the irises and nose lobes features within the eyes-nose templates. As 
explained in Chapter 3, these (x,y) locations are mapped to the input image 
using correspondences from optica] flow in order to locate the irises and nose 
lobes in the input image. 

3. For the recognizer itself, the feature locations are used to automatically define 
the bounding boxes of facial feature templates in the example images, as is 
discussed in Chapter 4. 

4. Lastly, during the geometrical alignment step between input and example im- 
ages, the recognizer registers the automatically located input features to the 



manually located example image features. 
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Appendix B 

Linear Classes: Shape and Texture 



As explained in section 7.3.1, linear classes is a technique for synthesizing new views of 
an object using views of prototypical objects belonging to the same object class. The 
basic idea is to decompose the novel object as a linear combination of the prototype 
objects. This decomposition is performed separately for the shape and texture of 
the novel object. In this appendix, we explain the mathematical detail behind the 
linear class approach for shape and texture. Please refer to sections 7.2 and 7.3.1 for 
definitions of the example prototype images, mathematical operators, etc. 



In this section, we reformulate the description of linear classes for shape that originally 
appeared in Poggio and Vetter [110]. The development here makes explicit the fact 
that the vectorized y vectors need not be in correspondence between the standard 
and virtual poses. 

Linear classes begins with the assumption that a novel object is a linear combi- 
nation of a set of prototype objects in 3D 



From this assumption, it is easy to see that any 2D view of the novel object will be 
the same linear combination of the corresponding 2D views of the prototypes. That 
is, the 3D linear decomposition is the same as the 2D linear decomposition. Using 
equation (7.2) which relates 3D and 2D shape vectors, let y„, r be a 2D view of a novel 
object 



B.l Shape 



(B.l) 



y n ,r = LY n 



(B.2) 



and let y PjV . be 2D views of the prototypes 



Y Pj ,r = LY f 



l<j<N. 



(B.3) 
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Apply the operator L to both sides of equation (B.l) 

LY n = L(J2jLj oijY Pj ). (B . 4) 
We can bring L inside the sum since L is linear 

LY n =£f =iai L Ypj .. (B>5) 
Substituting equations (B.2) and (B.3) yields 

Yn,r = LjLl <Xjy Pj ,r. 

Thus, the 2D linear decomposition uses the same set of linear coefficients as with the 
3D vectorization. 

Next, we show that under certain assumptions, the novel object can be analyzed 
at standard pose and the virtual view synthesized at virtual pose using a single set 
of linear coefficients. Again, assume that a novel object is a linear combination of a 
set of prototype objects in 3D 

Y " = S*i°iY w . (B.6) 
Say that we have 2D views of the prototypes at standard pose y p , 2D views of the 
prototypes at virtual pose y Pj , r , and a 2D view of the novel object y n at standard 
pose. Add,tionally, assume that the 2D views y Pj . are linearly independent. Project 
both sides of equation (B.6) using the rotation for standard pose, yielding 

y„ = Ef =1 a i y Pj . 

A unique solution for the aj exist since the y Pj are linearly independent. Now, since 
'7 W S ° lved f ° r the same set of -efficients in the 3D linear class assumption the 
decomposition at virtual pose must use the same coefficients 

y»,r = Ef =1 

That is, we can recover the a/a from the view at standard pose and use the */s to 
generate the virtual view of the novel object. 

B.2 Texture 

Virtually the same argument can be applied to the geometrically normalized texture 
vectors t. The ,dea of applying linear classes to texture was thought of by the author 
and independently by Vetter and Poggio [132]. 
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With the texture case, assume that a novel object texture T n is a linear combi- 
nation of a set of prototype textures 

Tn = Ef =1 #T Pj .. (B.7) 

As with shape, we show that the 3D linear decomposition is the same as the 2D linear 
decomposition. Using equation (7.3) which relates 3D and 2D texture vectors, let t n , r 
be a 2D texture of a novel object 

W = OT n (B.8) 
and let t Pj r be 2D textures of the prototypes 

t Pjtr = DT Pj l<j<N. (B.9) 

Apply the operator D to both sides of equation (B.7) 

£T n = £(£f =1 /?/r Pj ). (B.10) 
We can bring D inside the sum since D is linear 

Substituting equations (B.8) and (B.9) yields 

Thus, as with shape, the 2D linear decomposition for texture uses the same set of 
• linear coefficients as with the 3D vectorization. 

Next, we show that under certain linear independence assumptions, the novel 
object texture can be analyzed at standard pose and the virtual view synthesized at 
virtual pose using a single set of linear coefficients. Again, assume that a novel object 
texture T is a linear combination of a set of prototype objects 

T n = £jLj PjT Pj . ' (B.12) 

Say that we have 2D textures of the prototypes at standard pose t Pj , the 2D prototype 
textures at virtual pose t PjV ., and a 2D texture of the novel object at standard pose 
t n . Additionally, assume that the 2D textures t Pj are linearly independent. Project 
both sides of equation (B.12) using the rotation for standard pose, yielding 



tn = Ef =1 #V 
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A unique solution for the ft exist since the t Pj are linearly independent. Now, since 
we have solved for the same set of coefficients in the 3D linear class assumption, the 
decomposition at virtual pose must use the same coefficients 

W = Ef =1 ^t Pi(r . 

That is, we can recover the 0fa from the view at standard pose and use the ft's to 
generate the virtual view of the novel object. 
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